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Abstract 


Execution  Time  of  Symmetric  Eigensolvers 

by 

Kendall  Swenson  Stanley 
Doctor  of  Philosophy  in  Computer  Science 

University  of  California  at  Berkeley 
Professor  James  Denimel,  Chair 

The  execution  time  of  a  symmetric  eigendecomposition  depends  upon  the  application,  the 
algorithm,  the  implementation,  and  the  computer.  Symmetric  eigensolvers  are  used  in  a 
variety  of  applications,  and  the  requirements  of  the  eigensolver  vary  from  application  to 
application.  Many  different  algorithms  can  be  used  to  perform  a  symmetric  eigendecom- 
postion,  each  with  differing  computational  properties.  Different  implementations  of  the 
same  algorithm  may  also  have  greatly  differing  computational  properties.  The  computer  on 
which  the  eigensolver  is  run  not  only  affects  execution  time  but  may  favor  certain  algorithms 
and  implementations  over  others. 

This  thesis  explains  the  performance  of  the  ScaLAPAC'K  symmetric  eigensolver, 
the  algorithms  that  it  uses,  and  other  important  algorithms  for  solving  the  symmetric  eigen- 
problem  on  today’s  fastest  computers.  We  offer  advice  on  how  to  pick  the  best  eigensolver 
for  particular  situations  and  propose  a  design  for  the  next  ScaLAPACK  symmetric  eigensolver 
which  will  offer  greater  flexibility  and  50%  better  performance. 


Professor  James  Demmel 
Dissertation  Committee  Chair 
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Chapter  1 

Summary  -  Interesting 
Observations 


The  symmetric  eigendecomposition  of  a  real  symmetric  matrix  is:  A  =  QDQT , 
where  D  is  diagonal  and  Q,  is  orthonormal,  i.e.  QTQ  =  I.  Tridiagonal  based  methods 
reduce  A  to  a  tridiagonal  matrix  through  an  orthonormal  similarity  transformation,  i.e. 
A  =  ZTZt ,  compute  the  eigendecomposition  of  the  tridiagonal  matrix  T  =  U DUT  and, 
if  necessary,  transform  the  eigenvectors  of  the  tridiagonal  matrix  back  into  eigenvectors  of 
the  original  matrix  A,  i.e.  Q  =  ZU.  Non-tri diagonal  based  methods  operate  directly  on 
the  original  matrix  A. 

I  am  interested  in  understanding  and  minimizing  the  execution  time  of  dense  sym¬ 
metric  eigensolvers,  as  used  in  real  applications,  on  distributed  memory  parallel  computers. 
I  have  modeled  the  performance  of  symmetric  eigensolvers  as  a  function  of  the  algorithm, 
the  application,  the  implementation  and  the  computer.  Some  applications  require  only  a 
partial  eigendecomposition,  i.e.  only  a  few  eigenvalues  or  eigenvectors.  Different  implemen¬ 
tations  may  require  different  communication  or  computation  patterns  and  they  may  use 
different  libraries  and/or  compilers.  This  thesis  concentrates  on  the  0(n3)  cost  of  reduction 
to  tridiagonal  form  and  transforming  the  eigenvectors  back  to  the  original  space. 

I  have  modeled  the  execution  time  of  the  ScaLAPACK[31]  symmetric  eigensolver, 
PDSYEVX,  in  detail  and  validated  this  model  against  actual  performance  on  a  number  of 
distributed  memory  parallel  computers.  PDSYEVX,  like  most  ScaLAPACK  codes,  uses  calls 
to  the  PBLAS[41,  140]  to  perform  basic  linear  algebra  operations  such  as  matrix-matrix 
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multiply  and  matrix- vector  multiply  in  parallel.  PDSYEVX  and  the  PBLAS  use  calls  to  the 
Basic  Linear  Algebra  Subroutines,  BLAS[63,  62],  to  perform  basic  linear  algebra  operations 
such  as  matrix-matrix  multiply  and  matrix- vector  multiply  on  data  local  to  each  processor, 
and  calls  to  the  Basic  Linear  Algebra  Communications  Subroutines,  BLACS[169,  69],  to  move 
data  between  the  processors.  The  level  one  BLAS  involving  only  vectors  and  perform  0(n) 
flops  on  0(n)  data,  where  n  is  the  length  of  the  vector.  The  level  two  BLAS  involve  one 
matrix  and  one  or  two  vectors  and  perform  0(n2)  flops  on  0(n2)  data,  where  the  matrix  is 
of  size  n  X  n.  The  level  three  BLAS  involve  only  matrices  and  perform  0(n3)  flops  on  0(n2) 
data  and  offer  the  best  opportunities  to  obtain  peak  floating  point  performance  through 
data  re-use. 

PDSYEVX  uses  a  2D  block  cyclic  data  layout  for  all  input,  output  and  internal 
matrices.  2D  block  cyclic  data  layouts  have  been  shown  to  support  scalable  high  perfor¬ 
mance  parallel  dense  linear  algebra  codes[32,  30,  124]  and  lienee  have  been  selected  as  the 
primary  data  layout  for  HPF[110],  ScaLAPACK[68]  and  other  parallel  dense  linear  algebra 
libraries[98,  164].  A  2D  block  cyclic  data  layout  is  defined  by  the  processor  grid  (pr  by  pc), 
the  local  block  size  (mb  by  nb)  and  the  location  of  the  (1,1)  element  of  the  matrix.  In 
this  thesis,  we  will  assume  that  the  (1,1)  element  of  matrix  A,  i.e.  A(l,l)  is  mapped  to 
the  (1,1)  element  of  the  local  matrix  in  processor  (0,0).  Hence,  A(i,j)  is  stored  in  element 
( L mbxpr  J ( *  1,  m  b  X  pr )  T  1 ,  L  pfck  ■J  nb  +  mod  ( j  —  1,  nb  X  pc )  +  1 )  on  processor 

(  mod  ( [y^y]  iPr)i  mod  ( [“Tny-J  iPc))-  Figures  1.1  and  1.2,  reprinted  from  the  ScaLAPAC'K 
User’s  Guide[31]  shows  how  a  9  by  9  matrix  would  be  distributed  over  a  2  by  3  processor 
grid  with  mb  =  nb  =  2.  In  general,  we  will  assume  that  square  blocks  are  used  since  this  is 
best  for  the  symmetric  eigenproblem,  and  we  will  use  nb  to  refer  to  both  the  row  block  size 
and  the  column  block  size. 

All  ScaLAPACK  codes  including  PDSYEVX  in  version  1.5  use  the  data  layout  block 
size  as  the  algorithmic  blocking  factor.  Hence,  except  as  noted,  we  use  nb  to  refer  to 
the  algorithmic  blocking  factor  as  well  as  the  data  layout  block  size.  Data  layouts,  and 
algorithmic  blocking  factors  are  discussed  in  Section  2.3.3. 

PDSYEVX  calls  the  following  routines: 

PDSYTRD  Performs  Householder  reduction  to  tridiagonal  form. 

PDSTEBZ  Computes  the  eigenvalues  of  a  tridiagonal  matrix  using  bisection. 

PDSTEIN  Computes  the  eigenvectors  of  the  tridiagonal  matrix  using  inverse  iteration  and 
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Figure  1.1:  9  by  9  matrix  distributed  over  a  2  by  3  processor  grid  with  mb  =  nb  =  2 
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Gram- Schmidt  reorthogonalization. 

PDORMTR  Transforms  the  eigenvectors  of  the  tridiagonal  matrix  back  into  eigenvectors  of 
the  original  matrix. 

My  performance  models  explain  performance  in  terms  of  the  following  application  param¬ 
eters: 

n  The  matrix  size. 

m  The  number  of  eigenvectors  required, 
e  The  number  of  eigenvalues  required.  (V  >  m) 
the  following  machine  parameters: 

p  The  number  of  processors  (arranged  in  a  pr  by  pc  grid  as  described  below), 
o  The  communication  latency  (secs/message). 

(3  The  inverse  communication  bandwidth  (secs/double  precision  word).  This  means  that 
sending  a  message  of  k  double  precision  words  costs:  a  +  kf3. 
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Figure  1.2:  Processor  point  of  view  for  9  by  9  matrix  distributed  over  a  2  by  3  processor 
grid  with  mb  =  nb  =  2 
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7i  7  73*73  Time  per  flop  for  BLAS1,  BLAS2  and  BLAS3  routines  respectively. 

^1^2  7^37^4  Software  overhead  for  BLAS1,  BLAS2,  BLAS3  and  PBLAS  routines  respectively. 
This  means  that  a  call  to  DGEMM(a  BLAS3  routine)  requiring  c  flops  costs:  +  C73.  See 
Chapter  3  for  details  on  the  cost  of  the  BLAS.  The  cost  of  the  PBLAS  routine  PDSYMV 
is  shown  in  Table  4.3. 

My  model  also  uses  the  following  algorithmic  and  data  layout  parameters: 

pr  The  number  of  processor  rows  in  the  processor  grid. 

pc  The  number  of  processor  columns  in  the  processor  grid. 

nb  The  data,  layout  block  size  and  algorithmic  blocking  factor. 

These  and  all  other  variables  used  in  this  thesis  are  listed  in  Table  A.l  in  Ap¬ 
pendix  A. 

The  rest  of  this  chapter  presents  the  most  interesting  results  from  my  study  of  the 
execution  time  of  symmetric  eigensolvers  on  distributed  memory  computers.  Section  1.1 
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describes  the  algorithms  commonly  used  for  dense  symmetric  eigendecomposition  on  dis¬ 
tributed  memory  parallel  computers.  Section  1.2  describes  how  software  overhead  and  load 
imbalance  costs  are  significant.  Section  1.3  explains  the  two  rules  of  thumb  for  ensuring 
that  a  distributed  memory  parallel  computer  can  achieve  good  performance  on  a  dense  lin¬ 
ear  algebra  code  such  as  ScaLAPACK’s  symmetric  eigensolver.  Section  1.4  explains  that  it 
is  important  to  identify  which  techniques  offer  the  greatest  potential  for  improving  perfor¬ 
mance  across  a  wide  range  of  applications,  computers,  problem  sizes  and  distributed  memory 
parallel  computers.  Section  1.5  gives  a  synopsis  of  how  execution  time  of  the  ScaLAPACK 
symmetric  eigensolver  could  be  reduced.  Section  1.6  explains  the  types  of  applications  on 
which  Jacobi  can  be  expected  to  be  as  fast  as,  or  faster  than,  tridiagonal  based  methods. 

The  rest  of  my  thesis  is  organized  as  follows.  Chapter  2  provides  an  introduction 
and  a  historical  prospective.  Chapter  3  explains  the  performance  of  the  Basic  Linear  Algebra 
Subroutines  (BLAS).  Chapter  4  contains  my  complete  execution  time  model  for  ScaLAPACK’s 
symmetric  eigensolver,  PDSYEVX.  Chapter  5  simplifies  the  execution  time  model  by  concen¬ 
trating  on  a  particular  application  on  a  particular  distributed  memory  parallel  computer, 
the  Intel  Paragon.  Chapter  6  explains  the  performance  requirements  of  distributed  memory 
parallel  computers  and  discusses  the  execution  time  of  PDSYEVX.  Chapter  7  explains  the 
performance  of  other  dense  symmetric  eigensolvers.  Chapter  8  provides  a  blueprint  for  re¬ 
ducing  the  execution  time  of  PDSYEVX.  Chapter  9  offers  concise  advice  to  users  of  symmetric 
eigensolvers. 

1.1  Algorithms 

There  are  many  widely  disparate  symmetric  eigendecomposition  algorithms.  Tri di¬ 
agonal  reduction  based  algorithms  for  the  symmetric  eigendecomposition  require  asymptot¬ 
ically  the  fewest  flops  and  have  been  historically  the  fastest  and  most  popular[83,  79,  129, 
153,  86,  145,  134,  50]. 

Iterative  eigensolvers,  e.g.  Lanczos  and  conjugate  gradient  methods,  are  clearly 
superior  if  the  input  matrix  is  sparse  and  only  a  limited  portion  of  the  spectrum  is  needed[49, 
119].  Iterative  eigensolvers  are  out  of  the  scope  of  this  thesis. 

Even  for  tri  diagonal  matrices,  there  are  several  algorithms  worthy  of  attention 
for  the  tridiagonal  eigendecomposition.  The  ideal  method  would  require  at  most  0(n2) 
floating  point  operations,  0(n)  message  volume  and  0(p)  messages.  The  recent  work  of 
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Parlett  and  Dliillon[136,  139]  renews  hope  that  such  a  method  will  be  available  in  the 
near  future.  Should  this  effort  hit  unexpected  snags,  other  better  known  methods,  such 
as  QR.[79,  86,  93],  QD[135],  bisection  and  inverse  iteration[83,  102]  and  Cuppen’s  divide 
and  conquer  algoritlim[50,  66,  147,  88]  will  remain  common.  Parallel  codes  have  been 
written  for  QR[39,  8,  76,  125],  bisection  and  inverse  iteration[15,  75,  54,  81]  and  Cuppen’s 
algoritlim[82,  80,  141].  ScaLAPACK  offers  parallel  QR  and  parallel  bisection  and  inverse 
iteration  codes  and  Cuppen’s  algoritlim[50,  66,  88],  which  has  recently  replaced  QR.  as 
the  fastest  serial  metliod[147],  has  been  coded  for  inclusion  in  ScaLAPACK  by  Francoise 
Tisseur.  Algorithms  for  the  tridiagonal  eigenproblem  are  discussed  in  Section  2.2,  and 
parallel  tridiagonal  eigensolvers  are  discussed  in  Section  7.1. 

A  detailed  comparison  of  tridiagonal  eigensolvers  would  be  premature  until  Parlett 
and  Dliillon  complete  their  prototype. 

This  thesis  concentrates  on  the  0(n3)  cost  of  reduction  to  tri diagonal  form  and 
transforming  the  eigenvectors  back  to  the  original  space.  Hendrickson,  Jessup  and  Smitli[91] 
showed  that  reduction  to  tridiagonal  form  can  be  performed  50%  faster  than  ScaLAPACK 
does.  Lang’s  successive  band  reduction[116],  SBR,  is  interesting  at  least  if  only  eigenval¬ 
ues  are  to  be  computed.  But  the  complexity  of  SBR.  lias  made  it  difficult  to  realize  the 
theoretical  advantages  of  SBR.  in  practice.  A  performance  model  for  PDSYEVX,  ScaLAPACK’s 
symmetric  eigensolver,  section  7.1.2.  is  given  in  C'lia.pter  4.  By  restricting  our  attention  to 
a.  single  computer,  and  to  the  most  common  applications,  the  model  is  further  simplified 
and  discused  in  C'lia.pter  5. 

Jacobi  requires  4-20  times  as  many  floating  point  operations  as  tridiagonal  based 
methods,  lienee  the  type  of  problems  on  which  Jacobi  will  be  faster  will  always  be  lim¬ 
ited.  Jacobi  is  faster  than  tridia.gona.l  based  metliods[125,  2]  on  small  spectrally  diagonally 
dominant  matrices1 *  despite  requiring  4  times  as  many  flops  because  it  lias  less  overhead. 
However,  on  large  problems  tridia.gona.l  based  methods  can  achieve  at  least  25%  efficiency 
and  will  lienee  be  faster  than  any  method  requiring  4  times  as  many  flops.  And,  on  matrices 
that  are  not  spectrally  diagonally  dominant,  Jacobi  requires  20  or  more  times  as  many  flops 
as  tridia.gona.l  based  methods  -  a.  handicap  that  is  simply  too  large  to  overcome.  Jacobi’s 
method  is  discussed  in  Section  7.3. 

Methods  that  require  multiple  n  by  n  matrix- matrix  multiplies,  such  as  the  Invari- 

1  Spectrally  diagonally  dominant,  means  that,  the  eigenvector  matrix,  or  a  permutation  thereof,  is  diago¬ 

nally  dominant,. 
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ant  Subspace  Decomposition  Approach[97]  (ISDA),  and  Yan  and  Lu’s  FFT  based  metliod[174] 
require  roughly  30  times  as  many  floating  point  operations  as  tridiagonal  based  methods 
and  lienee  may  never  be  faster  than  tridiagonal  based  methods.  The  ISDA  for  solving 
symmetric  eigenproblems  is  discussed  in  Section  7.4. 

Banded  ISDA[26]  is  an  improvement  on  ISDA  that  involves 

an  initial  bandwdith  reduction.  Banded  ISDA[26]  is  nearly  a  tridiagonal  method 
and  offers  performance  that  is  nearly  as  good,  at  least  if  only  eigenvalues  are  sought.  How¬ 
ever  since  a  banded  ISDA  code  requires  multiple  bandwidth  reduction  each  of  which  requires 
a  back  transformation,  if  even  a  few  eigenvectors  are  required,  a  banded  ISDA  code  must  ei¬ 
ther  store  the  back  transformations  in  compact  form  or  it  will  perform  an  additional  0(n3) 
flops.  No  code  available  today  stores  and  applies  these  backtransformations  in  compact 
form.  At  present,  the  fastest  banded  ISDA  code  starts  by  reducing  the  matrix  to  tridiag¬ 
onal  form  and  is  neither  the  fastest  tridiagonal  eigensolver,  nor  the  easiest  to  parallelize. 
Banded  ISDA  is  discussed  in  Section  7.5. 

In  conclusion,  reduction  to  tridiagonal  form  combined  with  Parlett  and  Dhillons 
tridiagonal  eigensolver  is  likely  to  be  the  prefe  rred  me  thod  for  eigensolution  of  dense  matrices 
for  most  applications. 

In  the  meantime,  until  Parlett  and  Dliillon’s  code  is  available,  we  believe  that 
PDSYEVX  is  the  best  general  purpose  symmetric  eigensolver  for  dense  matrices.  It  is  available 
on  any  machine  to  which  ScaLAPACK  has  been  ported2,  it  achieves  50%  efficiency  even 
when  the  flops  in  the  tridiagonal  eigensolution  are  not  counted3  and  it  scales  well,  running 
efficiently  on  machines  with  thousands  of  nodes.  It  is  faster  than  ISDA  and  faster  than 
Jacobi  on  large  matrices  and  on  matrices  that  are  not  spectrally  diagonally  dominant. 

1.2  Software  overhead  and  load  imbalance  costs  are  signifi¬ 
cant 

In  PDSYEVX,  it  is  somewhat  surprsing  but  true  that  software  overhead  and  load 

imbalance  costs  are  larger  than  communications  costs.  In  its  broadest  definition,  software 

overhead  is  the  difference  between  the  actual  execution  time  and  the  cost  of  communication 

2Intel  Paragon,  Cray  T3D,  Cray  T3E,  IBM  SP2,  and  any  machine  supporting  the  BLACS,  MPI  or  PVM 

°Our  definition  of  efficiency  is  a  demanding  one:  total  time  divided  by  the  time  required  by  reduction 
to  tridiagonal  form  and  back  transformation  assuming  that  these  are  performed  at  the  peak  floating  point 
execution  rate  of  the  machine,  i.e.  time / ( ^-73) 
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and  computation.  Software  overhead  includes  saving  and  restoring  registers,  parameter 
passing,  error  and  special  case  checking  as  well  as  those  tasks  which  prevent  calls  to  the 
BLAS  involving  few  flops  from  being  as  efficient  as  calls  to  the  BLAS  involving  many  flops: 
loop  overhead,  border  cases  and  data  movement  between  memory  hierarchies  that  gets 
amortized  over  all  the  operations  in  a  given  call  to  the  BLAS.  The  cost  of  any  operation 
which  is  performed  by  only  a  few  of  the  processors  (while  the  other  processors  are  idle)  is 
a  load  imbalance  cost. 

Because  software  overhead  is  as  significant  as  communication  latency,  the  three 
term  performance  model  introduced  by  Choi  et  al.[40]  and  used  in  my  earlier  work[57], 
which  only  counts  flops,  number  of  messages  and  words  communicated,  does  not  adequately 
model  the  performance  of  PDSYEVX.  In  addition  to  these  three  terms  a  fourth  term,  which 
we  designate  8,  representing  software  overhead  costs  is  required. 

Software  overhead  is  more  difficult  to  measure,  study,  model  and  reason  about 
than  the  other  components  of  execution  time.  Measuring  the  execution  time  required  for  a 
subroutine  call  requiring  little  or  no  work  measures  only  subroutine  call  overhead,  parameter 
passing  and  error  checking.  For  the  performance  models  in  this  thesis,  we  measure  the 
execution  time  of  each  routine  across  a  range  of  problem  sizes  (with  code  cached  and  data 
not  cached)  and  use  curve  fitting  to  estimate  the  software  overhead  of  an  individual  routine. 
Because  we  perform  these  timings  with  code  cached  but  data  not  cached,  this  gives  an 
estimate  of  all  software  overhead  costs  except  code  cache  misses. 

We  use  times  with  the  code  cached  and  data  for  our  performance  models  because, 
for  most  problem  sizes,  the  matrix  is  too  large  to  fit  in  cache  but  it  is  less  clear  whether  code 
fits  in  cache  or  not.  It  is  easy  to  compute  the  amount  of  data  which  must  be  cached,  but 
there  is  no  portable  automatic  way  to  measure  the  amount  of  code  which  must  be  cached. 
Furthermore,  the  data  cache  needs,  for  typical  problem  sizes,  are  much  larger  than  code 
cache  needs,  lienee  while  it  is  usually  clear  that  the  data  is  not  cached  the  code  cache  needs 
and  code  cache  size  are  much  closer. 

A  full  study  of  software  overhead  costs  is  out  of  the  scope  of  this  thesis  and  remains 
a  topic  for  future  research.  The  overhead  and  load  imbalance  terms  in  the  performance 
model  for  PDSYEVX  on  the  PARAGON  are  explained  in  Sections  5.7  and  5.8. 
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1.3  Effect  of  machine  performance  characteristics  on  PDSYEVX 


The  most  important  machine  performance  characteristic  is  the  peak  floating  point 
rate.  Bisection  bandwidth  essentially  defines  which  machines  ScaLAPACK  can  perform  well 
on.  Message  latency  and  software  overhead,  since  they  are  0(n)  terms  are  important 
primarily  for  small  and  medium  matrices. 

Most  collections  of  computers  fall  into  one  of  two  groups:  those  connected  by 
a  switched  network  whose  bisection  bandwditli  increases  linearly  (or  nearly  so)  with  the 
number  of  processors  and  those  connected  by  a  network  that  only  allows  one  processor  to 
send  at  a  time.  All  current  distributed  memory  parallel  computers  that  I  am  aware  of  have 
adequate  bisection  bandwditli4  to  support  good  efficiency  on  PDSYEVX.  On  the  other  hand, 
no  network  that  only  allows  one  processor  to  send  at  a  time  can  allow  scalable  performance 
and  none  that  I  am  aware  of  allows  good  performance  with  as  many  as  16  processors.  As 
long  as  the  bandwidth  rule  of  thumb  (explained  in  detail  in  Section  6.1.1)  holds,  bandwidth 
will  not  be  the  limiting  factor  in  the  performance  of  PDSYEVX. 

Bandwidth  ride  of  thumb:  Bisection  bandwidth  per  processor5  times  the  square  root 
of  memory  size  per  processor  should  exceed  floating  point  performance  per  processor. 


Megabytes/sec 

processor 


^/Megabytes 

processor 


> 


Megaflops/sec 

processor 


assures  that  bandwidth  will  not  limit  performance. 


Assuming  that  the  bandwidth  is  adequate,  we  consdier  next  the  problem  size  per 
processor: 

If  the  problem  is  large  enough,  i.e.  (n2 /p)  >  2  X  (Megaflops/ processor),  then 
PDSYEVX  should  execute  reasonably  efficiently.  This  rule  (explained  in  detail  in  Section  6.1.2 
can  be  restated  as: 

Memory  size  rule  of  thumb:  memory  size  should  match  floating  point  performance 

Memory  size  rule  of  thumb:  memory  size  should  match  floating  point  performance 

4  Few  distributed  memory  parallel  computers  offer  bandwidth  that  scales  linearly  with  the  number  of 
processors  but  most  still  have  adequate  bisection  bandwidth. 

5  Bisection  bandwidth  per  processor  is  the  total  bisection  bandwidth  of  the  network  divided  by  the  number 
of  processors. 
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Megabytes  Megaflops/sec 
processor  processor 

assures  that  PDSYEVX  will  be  efficient  on  large  problems. 

If  the  problem  is  not  large  enough,  lower  order  terms,  as  explained  in  Chap¬ 
ter  4  will  be  significant.  Unlike  the  peak  flop  rate  which  can  be  substantially  independent 
of  main  memory  performance,  lower  order  terms  (communication  latency,  communication 
bandwidth,  software  overbad  and  load  imbalance)  are  strongly  linked  to  main  memory 
performance. 

PDSYEVX  can  work  well  on  machines  with  large  slow  main  memory  (on  large  prob¬ 
lems)  and  or  machines  with  small  fast  main  memory  (on  small  problems).  Most  distributed 
memory  parallel  computers  have  sufficient  memory  size  and  network  bisection  bandwidth 
to  allow  PDSYEVX  to  achieve  high  efficiency  on  large  problem  sizes.  The  Cray  T3E  is  one  of 
the  few  machines  that  has  sufficient  main  memory  performance  to  allow  PDSYEVX  to  achieve 
high  performance  on  small  problem  sizes.  The  effect  of  machine  performance  characteristics 
on  PDSYEVX  is  discussed  in  Chapter  6. 

1.4  Prioritizing  techniques  for  improving  performance. 

One  fo  the  most  importatn  uses  of  performance  modeling  is  to  identify  which 
techniques  offer  the  most  promise  for  performance  improvement,  because  there  are  too 
many  performance  improvement  techniques  to  allow  one  to  try  them  all.  One  technique 
that  appeared  to  be  important  early  in  my  work,  optimizing  global  communications,  now 
appears  less  important  in  light  of  the  discovery  that  software  overhead  and  load  imbalance 
are  more  significant  than  earlier  thought.  Here  we  talk  about  general  conclusions;  details 
are  summarized  in  Section  1.5,  and  elaborated  in  Chapters  7  and  8. 

Overlapping  communication  and  computation,  though  it  undeniably  increases  per¬ 
formance,  should  be  implemented  only  after  every  effort  has  been  made  to  reduce  both 
communications  and  computations  costs  as  much  as  possible.  Overlapping  communication 
and  computation  has  proven  to  be  more  attractive  in  theory  than  in  practice  because  not 
all  communication  costs  overlap  well  and  communication  costs  are  not  the  only  impediment 
to  good  parallel  performance. 
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Although  Strassen’s  matrix  multiplication  has  been  proven  to  offer  performance 
better  than  can  be  achieved  through  traditional  methods,  it  will  be  a  long  time  before 
a  Strassen’s  matrix  multiply  is  shown  to  be  twice  as  fast  as  a  traditional  method.  A 
typical  single  processor  computer  would  require  2-4  Gigabytes  of  main  memory  to  achieve 
an  effective  flop  rate  of  twice  the  machine’s  peak  flop  rate6  and  2-4  Terabytes  of  main 
memory  to  achieve  4  times  the  peak  flop  rate.  Strassen’s  matrix  multiplication  will  get 
increasing  use  in  the  coming  years,  because  achieving  20%  above  “peak”  performance  is 
nothing  to  sneeze  at,  but  Strassen’s  matrix  multiply  will  not  soon  make  matrix  multiply 
based  eigendecomposition  such  as  ISDA  faster  than  tridiagonal  based  eigendecomposition. 

1.5  Reducing  the  execution  time  of  symmetric  eigensolvers 

PDSYEVX  can  be  improved.  It  does  not  work  well  on  matrices  with  large  clusters 
of  eigenvalues.  And,  it  is  not  as  efficient  as  it  could  be[91],  achieving  only  50%)  of  peak 
efficiency  on  PARAGON,  Cray  T3D  and  Berkeley  NOW  even  on  large  matrices.  On  small 
matrices  it  performs  worse.  Parlett  and  Dliillon’s  new  tridiagonal  eigensolver  promises  to 
solve  the  clustered  eigenvalue  problem  so  we  concentrate  on  improving  the  performance  of 
reduction  to  tridiagonal  form  and  back  transformation. 

Input  and  output  data  layout  need  not  affect  execution  time  of  a  parallel  sym¬ 
metric  eigensolver  because  data  redistribution  is  cheap.  Data  redistribution  requires  only 
0(p)  messages  and  0(n2/p)  message  volume  per  processor.  This  is  modest  compared  to 
0(n  log(p))  messages  and  0(n2/^/p)  message  volume  per  processor  required  by  reduction 
to  tridiagonal  form  and  back  transformation. 

Separating  internal  and  external  data  layout  actually  decreases  minimum  execution 
time  over  all  data  layouts.  Separating  internal  and  external  data  layouts  allows  reduction 
to  tridiagonal  form  and  back  transformation  to  use  different  data  layouts.  It  also  allows 
codes  to  concentrate  only  on  the  best  data  layout,  reducing  software  overhead  and  allowing 
improvements  which  would  be  prohibitively  complicated  to  implement  if  they  had  to  work 
on  all  two-dimensional  block  cyclic  data  layouts. 

Separating  internal  and  external  data  layouts  increases  the  minimum  workspace 
requirement'  from  2.5  n2  to  3  n2.  However  with  minor  improvements  in  the  existing  code, 

6  A  dual  processor  computer  would  require  twice  as  much  memory. 

1  Assuming  that  data  redistribution  is  not  performed  in  place.  It  is  difficult  to  redistribute  data  in  place 
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and  without  any  changes  to  the  interface,  internal  and  external  data  layout  can  be  separated 
without  increasing  the  workspace  requirement.  See  Section  8.5. 

Lichtenstein  and  Jolinson[124]  point  out  that  data  layout  is  irrelevant  to  many 
linear  algebra  problems  because  one  can  solve  a  permuted  problem  instead  of  the  original. 
This  works  for  symmetric  problems  provided  that  the  input  data  is  distributed  over  a  square 
processor  grid  and  with  a  row  block  size  is  equal  to  the  column  block  size. 

Hendrickson,  Jessup  and  Smitli[91]  demonstrated  that  the  performance  of  PDSYEVX 
can  be  improved  substantially  by  reducing  load  imbalance,  software  overhead  and  commu¬ 
nications  costs.  Most  of  the  inefficiency  in  PDSYEVX  is  in  reduction  to  tridiagonal  form. 
Software  overhead  and  load  imbalance  are  responsible  for  more  of  the  inefficiency  than  the 
cost  of  communications.  Hence,  it  is  those  areas  that  need  to  be  sped  up  the  most.  Pre¬ 
liminary  results[91]  indicate  that  by  abandoning  the  PBLAS  interface,  using  BLAS  and  BLACS 
calls  directly,  and  concentrating  on  the  most  efficient  data  layout,  software  overhead,  load 
imbalance  and  communications  costs  can  be  cut  in  half.  Strazdins  has  investigated  reducing 
software  overheads  in  the  PBLAS[161],  but  it  remains  to  be  seen  whether  software  overheads 
in  the  PBLAS  can  be  reduced  sufficiently  to  allow  PDSYEVX  to  be  as  efficient  as  it  could  be. 
PDSYEVX  performance  can  be  improved  further  if  the  compiler  can  produce  efficient  code  on 
simple  doubly  nested  loops,  implementing  merged  BLAS  Level  2  operations  (like  DSYMV  and 
dsyr2). 

For  small  matrices,  software  overhead  dominates  all  costs,  and  lienee  one  should 
minimize  software  overhead  even  at  the  expense  of  increasing  the  cost  per  flop.  An  unblocked 
code  has  the  potential  to  do  just  that. 

Although  back  transformation  is  more  efficient  than  reduction  to  tridiagonal  form, 
it  can  be  improved.  Whereas  software  overhead  is  the  largest  source  of  inefficiency  in  re¬ 
duction  to  tridiagonal  form,  communications  cost  and  load  imbalance  are  the  largest  source 
of  inefficiency  in  back  transformation.  Load  imbalance  is  hard  to  eliminate  in  a  blocked 
data  layout  in  reduction  to  tridiagonal  form  because  the  size  of  the  matrix  being  updated 
is  constantly  changing  (getting  smaller),  but  in  back  transformation,  all  eigenvectors  are 
constantly  updated,  so  statically  balancing  the  number  of  eigenvalues  assigned  to  each 
processor  works  well.  Therefore  the  best  data  layout  for  back  transformation  is  a  two- 
dimensional  rectangular  block-cyclic  data  layout.  The  number  of  processor  columns,  pc , 


between  two  arbitrary  parallel  data  layouts.  If  efficient  in-place  data  redistribution  were  feasible,  separating 
internal  and  external  data  layout  would  require  only  a  trivial  increase  in  workspace. 
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should  exceed  the  number  of  processor  rows  by  a  factor  of  approximately  8.  The  optimal 
data  layout  column  block  size  is:  \n/(pck)~\  for  some  small  integer  k.  The  row  block-size 
is  less  important  in  back  transformation,  and  32  is  a  reasonable  choice,  although  setting  it 
to  the  same  value  as  the  column  block  size  will  also  work  well  if  the  BLAS  are  efficient  on 
that  block  size  and  pr  <  pc.  Many  techniques  used  to  improve  performance  in  LIT  decom¬ 
position,  such  as  overlapping  communication  and  computation,  pipelining  communication 
and  asynchronous  message  passing  can  also  be  used  to  improve  the  performance  of  back 
transformation.  Of  these  techniques,  only  asynchronous  message  passing  (which  eliminates 
all  local  memory  movement)  requires  modification  to  the  BLACS  interface.  The  modification 
to  the  BLACS  needed  to  support  asynchronous  message  passing  would  allow  forward  and 
backward  compatibility. 

All  of  these  methods  are  discussed  in  Chapter  8. 

1.6  Jacobi 

A  one-sided  Jacobi  method  with  a  two-dimensional  data  layout  will  beat  tridi¬ 
agonal  based  eigensolvers  on  small  spectrally  diagonally  dominant  matrices.  The  simpler 
one- dimensional  data  layout  is  sufficient  for  modest  numbers  of  processors,  perhaps  as  many 
as  a  few  hundred,  but  does  not  scale  well.  Tridiagonal  based  methods,  because  they  require 
fewer  flops,  will  beat  Jacobi  methods  on  random  matrices  regardless  of  their  size  on  large 
(n  >  200y/p)  matrices  even  if  they  are  spectrally  diagonally  dominant.  Jacobi  also  remains 
of  interest  in  some  cases  when  high  accuracy  is  desired[58].  Jacobi’s  method  is  discussed  in 
Section  7.3. 

1.7  Where  to  obtain  this  thesis 


This  thesis  is  available  at:  http://www.cs.berkeley.edu/stanley/thesis 
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Chapter  2 

Overview  of  the  design  space 

2.1  Motivation 

The  execution  time  of  any  computational  solution  to  a  problem  is  a  single- valued 
function  (time)  on  a  multi- dimensional  and  non-uniform  domain.  This  domain  includes 
the  problem  being  solved,  the  algorithm,  the  implementation  of  the  algorithm  and  the 
underlying  hardware  and  software  (sometimes  referred  to  collectively  as  the  computer).  By 
studying  one  problem,  the  symmetric  eigenproblem,  in  detail  we  gain  insight  into  how  each 
of  these  factors  affects  execution  time. 

Section  2.2  discusses  the  most  important  algorithms  for  dense  symmetric  eigende- 
composition  on  distributed  memory  parallel  computers.  Section  2.3  discusses  the  effect  that 
the  implementation  can  have  on  execution  time.  Section  2.4  discusses  the  effect  of  various 
hardware  characteristics  on  execution  time.  Section  2.5  lists  several  applications  that  uses 
symmetric  eigendecomposition  and  their  differing  needs.  Section  2.6  discusses  the  direct  and 
indirect  effects  of  machine  load  on  the  execution  time  of  a  parallel  code.  Section  2.7  outlines 
the  most  important  historical  developments  in  parallel  symmetric  eigendecomposition. 

2.2  Algorithms 

The  most  common  symmetric  eigensolvers  which  compute  the  entire  eigendecom¬ 
position  use  Householder  reduction  to  tridiagonal  form,  form  the  eigendecomposition  of  the 
tri diagonal  matrix  and  transform  the  eigenvectors  back  to  the  original  basis.  Algorithms 
that  do  not  begin  by  reduction  to  tri  diagonal  form  require  more  floating  point  operations. 
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Except  for  small  spectrally  diagonally  dominant  matrices,  on  which  Jacobi  will  likely  be 
faster  than  tri diagonal  based  methods,  and  scaled  diagonally  dominant  matrices  on  which 
Jacobi  is  more  accurate[58],  tridiagonal  based  codes  will  be  best  for  the  eigensolution  of 
dense  symmetric  matrices.  See  Section  7.3  for  details. 

The  recent  work  of  Parlett  and  Dliillon  offers  the  promise  of  computing  the  tridi¬ 
agonal  eigendecomposition  with  0(n2)  flops  and  0(p)  messages.  Should  some  unexpected 
hitch  prevent  this  from  being  satisfactory  on  some  matrix  types,  there  are  several  other 
algorithms  from  which  to  choose.  Experience  with  existing  implementations  shows  that  for 
most  matrices  of  size  2000  by  2000  or  larger,  the  tridiagonal  eigendecomposition  is  a  modest 
component  of  total  execution  time. 

Reduction  to  tridiagonal  form  and  back  transformation  are  the  most  time  con¬ 
suming  steps  in  the  symmetric  eigendecomposition  of  dense  matrices.  These  two  steps 
require  more  flops  (0(n3)  vs.  0(n 2)),  more  message  volume  (0(n2  ^/p)  vs.  0(n2))  and 
more  messages  (0(n  log(p))  vs.  O(p))  than  the  eigendecomposition  of  the  tridiagonal  ma¬ 
trix.  Since  the  cost  of  the  matrix  transformations  (reduction  to  tridiagonal  form  and  back 
transformation)  grows  faster  than  the  cost  of  tridiagonal  eigendecomposition,  the  matrix 
transformations  are  the  dominant  cost  for  larger  matrices. 

Reduction  to  tridiagonal  form  and  back  transformation  require  different  commu¬ 
nication  patterns.  Reduction  to  tridiagonal  form  is  a  two-sided  transformation  requiring 
multiplication  by  Householder  reflectors  from  both  the  left  and  right  side.  Two  sided  reduc¬ 
tions  require  that  every  element  in  the  trailing  matrix  be  read  for  each  column  eliminated, 
lienee  half  of  the  flops  are  BLAS2  matrix- vector  flops  and  0(n  log(p))  messages  are  required. 

Equally  importantly,  two-sided  reductions  require  significant  calculations  within 
the  inner  loop,  which  translates  into  large  software  overhead.  Indeed  on  the  computers 
that  we  considered,  software  overhead  appears  to  be  a  larger  factor  in  limiting  efficiency  of 
reduction  to  tridiagonal  form  than  communication. 

Back  transformation  is  a  one-sided  transformation  with  updates  than  can  be 
formed  anytime  prior  to  their  application.  Hence  back  transformation  requires  0(n/ nb) 
messages  (where  nb  is  the  data  layout  block  size)  and  far  less  software  overhead  than  re¬ 
duction  to  tridiagonal  form. 

Chapters  4  and  5  discuss  the  execution  time  of  reduction  to  tridiagonal  form  and 
back  transformation,  as  implemented  in  ScaLAPACK,  in  detail. 
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2.3  Implementations 

2.3.1  Parallel  abstraction  and  languages 

There  are  three  common  ways  of  expressing  parallelism  in  linear  algebra  codes: 
message  passing,  shared  memory  and  calls  to  the  BLAS.  Message  passing  programs  tend 
to  keep  communication  to  a  minimum,  in  part  because  the  communication  is  specified 
directly.  Shared  memory  codes  can  outperform  message  passing  codes  when  load  imbalance 
costs  outweigh  communication  costs[118].  All  calls  to  the  BLAS  offer  potential  parallelism 
though  the  potential  for  speedup  varies.  ScaLAPACK  uses  message  passing  while  LAPACK 
exposes  parallelism  through  calls  to  the  BLAS. 

In  some  cases,  recent  compilers  are  able  to  identify  the  parallelism  in  codes  that 
may  not  have  been  written  specifically  for  parallel  execution [172,  171].  However,  experience 
has  shown  that  programs  designed  for  sequential  machines  rarely  exhibit  the  properties 
necessary  for  efficient  parallel  execution,  lienee  some  research  into  parallelizing  compilers 
has  switched  its  emphasis  to  parallelizing  codes  which  are  written  in  languages  such  as 
HPF[94,  110]  which  allow  the  programmer  to  express  parallelism  and  allow  some  control 
over  data  layout. 

Codes  written  in  any  standard  sequential  language,  such  as  C,  C  +  +  or  Fortran  can 
achieve  high  performance,  especially  if  the  majority  of  the  operations  are  performed  within 
calls  to  the  BLAS.  If  the  flops  are  performed  within  codes  written  in  the  language  itself,  the 
execution  time  will  depend  upon  the  code  and  the  compiler  more  than  on  the  language  used. 
If  pointers  are  used  carelessly  in  C,  the  compiler  may  not  be  able  to  determine  the  data 
dependencies  exactly  and  may  have  to  forgo  certain  optimizations[172].  On  the  other  hand, 
carefully  crafted  C  codes,  tuned  for  individual  architectures  and  compiled  with  modern 
optimizing  compilers  can  result  in  performance  that  rivals  that  of  carefully  tuned  assembly 
codes[23,  168]. 

2.3.2  Algorithmic  blocking 

A  blocked  code  is  one  that  has  been  recast  to  allow  some  of  the  flops  to  be  per¬ 
formed  as  efficient  BLAS3  matrix-matrix  multiply  flops[6,  4].  Typically  a  block  of  columns  is 
reduced  using  an  unblocked  code  followed  by  a  matrix-matrix  update  of  the  trailing  matrix. 
The  algorithmic  blocking  factor  is  the  number  of  columns  (or  rows)  in  the  block  column. 
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In  serial  codes,  data  layout  blocking  does  not  exist  and  lienee  the  algorithmic  blocking  fac¬ 
tor  is  referred  to  simply  as  the  blocking  factor.  In  ScaLAPACK  version  1.5,  the  algorithmic 
blocking  factor  is  set  to  match  the  data  layout  blocking  factor. 

2.3.3  Internal  Data  Layout 

Most  of  the  flops  in  blocked  dense  linear  codes  involve  a  rank  A’  update,  i.e.  A'  = 
A  +  B  *C  where  A  E  Rm,n,  B  E  Rm,k,  C  E  Rk,n  and  m,  n  are  0(n)  and  k  is  the  algorithmic 
blocking  factor  (a  tuning  parameter  typically  much  smaller  than  n  or  m).  A  may  be 
triangular  and  B  and/or  C  may  be  transposed  or  conjugate  transposed.  Hence  internal 
data  layout  must  support  good  performance  on  such  rank  A  updates. 

A  is  typically  updated  in  place,  i.e.  the  node  which  owns  element  A.;j  computes 
and  stores  A/  • .  This  is  called  the  owner  computes  rule  and  is  motivated  by  the  high  cost 
of  data  movement  relative  to  the  cost  of  floating  point  computation.  If  k  is  large  enough  a 
3D  data  layout  is  more  efficient [1]  [12],  and  performance  can  be  improved  further  by  using 
Strassen’s  matrix  multiply[157]  [96]  [70].  Some  dense  linear  algebra  codes,  including  LIT, 
can  be  recursively  partitioned[165]  resulting  in  large  values  of  k  for  the  majority  of  the 
flops.  Nonetheless,  though  a  3D  data  layout  might  be  best  for  a  recursively  partitioned  LIT, 
reduction  to  tridiagonal  form  is  most  efficient  with  a  modest  algorithmic  blocking  factor 
and  lienee  it  is  more  efficient  to  update  A  in  place  and  we  will  make  that  assumption  for 
the  rest  of  this  discussion. 

If  A  is  to  be  updated  in  place,  a  2D  layout  minimizes  the  total  communication 
requirement  for  rank  k  updates.  The  elements  of  B  and  C  which  must  be  sent  to  each  node 
are  determined  by  the  elements  of  A  owned  by  that  node.  The  node  that  owns  element  A;j 
must  obtain  a  copy  of  row  i  of  B  and  column  j  of  C .  The  number  of  elements  of  matrices 
B  and  C  that  a  given  node  must  obtain  is  k  times  the  number  of  rows  and  columns  of  A 
for  which  the  node  owns  at  least  one  element.  If  a  node  must  own  r2  elements,  the  number 
of  elements  of  B  and  C  which  must  be  obtained  is  minimized  if  the  node  owns  a  square 
submatrix  of  A  corresponding  to  r  rows  and  r  columns.  In  a  2D  layout,  the  processors  are 
arranged  in  a  rectangular  grid.  Each  row  of  the  matrix  is  assigned  to  a  row  of  the  processor 
grid.  Each  column  is  assigned  to  a  column  of  the  processor  grid. 

The  common  ways  of  assigning  the  rows  and  columns  to  the  processor  grid  in  a 
2D  layout  are:  block,  cyclic  and  block-cyclic.  For  the  following  descriptions,  we  will  assume 
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that  we  are  distributing  n  rows  of  A  over  pr  processor  rows.  In  a  cyclic  layout,  row  i  is 
assigned  to  processor  row  i  mod  pr.  In  a  block  layout,  row  i  is  assigned  to  processor  row 
L  \njp  ]  J  •  I11  a  block-cyclic  data  layout,  row  i  is  assigned  to  processor  row  mod  pr J, 
where  nb  is  the  data  layout  block-size.  The  block-cyclic  data  layout  includes  the  other  two 
as  special  cases. 

Block-cyclic  data  layouts  simplify  algorithmic  blocking  and  are  used  in  most  paral¬ 
lel  dense  linear  algebra  libraries[68]  [98,  164].  However,  by  separating  algorithmic  blocking 
from  data  blocking  it  is  usually1  possible  to  achieve  high  performance  from  a  cyclic  data 
layout [91,  140,  44,  158]. 

One-dimensional  data  layouts  require  0(n2)  data  movement  per  node  (compared 
to  0(n2/ y^p)  for  2D  data  layouts)  and  are  generally  less  efficient.  However,  there  are  certain 
situations  in  which  ID  data  layouts  are  preferred.  If  the  communication  pattern  is  strictly 
one- dimensional  (i.e.  only  along  rows  or  columns)  a  ID  data  layout  requires  no  communi¬ 
cation.  Furthermore,  some  applications,  such  as  LIT,  require  much  more  communication  in 
one  direction  than  the  other2.  Hence,  for  modest  numbers  of  processors  it  may  be  better 
to  use  a  ID  data  layout. 

A  square  processor  grid  can  greatly  simplify  symmetric  reductions  -  allowing  lower 
overhead  codes.  Furthermore,  I  believe  that  pipelining  and  lookahead  (see  section  2.4.2) 
can  only  be  used  effectively  on  symmetric  reductions  (such  as  C'liolesky  and  reduction  from 
generalized  to  standard  form)  when  a  square  processor  grid  is  used3. 

All  existing  parallel  dense  linear  algebra  libraries  use  the  same  input  data  layout 
as  the  internal  data  layout.  In  C'liapter  8  I  will  demonstrate  that  this  is  not  necessary  to 
achieve  high  performance  and  that  in  fact  performance  can  be  improved  by  using  a  different 
data  layout  internally  than  the  input  and  output  data  layout. 

1  Block-cyclic  data  layouts  still  maintain  an  advantage  over  cyclic  data  layouts  on  machines  with  high 
communication  latency,  especially  in  those  algorithms,  such  as  Cholesky  and  back  transformation,  that 
require  only  0(n/ nb)  messages,  where  nb  is  the  data  layout  block-size. 

2LU  with  partial  pivoting  requires  0(n  log (p ) )  messages  within  the  processor  columns  but  only  0(n / nb) 
messages  within  the  processor  rows[31,  40,  30].  The  total  volume  of  communication  however  is  similar  in 
both  directions. 

3  Pipelining  and  lookahead  cannot  be  used  in  reduction  to  tridiagonal  form  because  of  its  synchronous 
nature. 
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2.3.4  Libraries 

Software  libraries  can  improve  portability,  robustness,  performance  and  software 
re-use.  ScaLAPACK  is  built  on  top  of  the  BLAS  and  BLACS  and  lienee  will  run  on  any  system 
on  which  a  copy  of  the  BLAS[63,  62]  and  BLACS[169,  69]  can  be  obtained. 

Libraries,  and  their  interface,  have  both  a  positive  and  a  negative  effect  on  per¬ 
formance.  The  existence  of  a  standard  interface  to  the  BLAS  means  that  by  improving  the 
performance  of  a  limited  set  of  routines,  i.e.  the  BLAS,  one  can  improve  the  performance  of 
the  entire  LAPACK  and  ScaLAPACK  library  and  other  codes  as  well.  Hence,  many  manufac¬ 
turers  have  written  optimized  BLAS  for  their  machines.  In  addition,  Bilmes  et  al. [23,  168] 
have  written  a  portable  high  performance  matrix-matrix  multiply  and  two  other  research 
groups  have  written  high  performance  BLAS  that  depend  only  on  the  existence  of  a  high 
performance  matrix-matrix  multiply [51,  103,  104].  Portable  high  performance  BLAS  offers 
the  promise  of  high  performance  on  LAPACK  and  ScaLAPACK  codes  without  the  expense  of 
hand  coded  BLAS. 

However,  adhering  to  a  particular  library  interface  necessarily  rules  out  some  possi¬ 
bilities.  The  BLACS  do  not  support  asynchronous  receives,  a  costly  limitation  on  the  Paragon. 
The  BLAS  do  not  meet  all  computational  needs[108],  especially  in  parallel  codes[91],  lienee 
the  programmer  is  faced  with  the  choice  of  reformulating  code  to  use  wliat  the  BLAS  offers 
or  avoiding  the  BLAS  and  trusting  the  compiler  to  produce  high  performance  code.  Fur¬ 
thermore,  the  interface  itself  implies  some  overhead,  at  the  very  least  a  subroutine  call 
but  typically  much  more  than  tliat[161].  Strazdins[161]  showed  that  software  overhead  in 
ScaLAPACK  accounts  for  15-20%  of  total  execution  time  even  for  the  largest  problems  that 
fit  in  memory  on  a  Fujitsu  VP1000. 

2.3.5  Compilers 

Compiler  code  generation  is  relatively  unimportant  to  LAPACK  and  ScaLAPACK 
performance,  because  these  codes  are  written  so  that  most  of  the  work  is  done  in  the  calls 
to  the  BLAS.  By  contrast,  EISPACK  is  written  in  Fortran  without  calls  to  the  BLAS  and  lienee 
its  performance  is  dependent  on  the  quality  of  the  code  generated  by  the  Fortran  compiler. 

Lelioucq  and  Carr[35]  argue  that  compilers  now  have  the  capability  to  perform 
many  of  the  optimizations  that  the  LAPACK  project  performed  by  hand.  Although  no  com¬ 
pilers  existing  today  can  produce  code  as  efficient  as  LAPACK  from  simple  three  line  loops, 
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the  compiler  technology  exists[149,  115,  148]. 

Today,  most  compilers  are  able  to  produce  good  code  for  single  loops,  reducing  the 
performance  advantage  of  the  BLAS1  routines.  Soon  compilers  will  be  able  to  produce  good 
code  for  BLAS2  and  even  BLAS3  routines.  This  will  require  us  to  rethink  certain  decisions, 
especially  where  the  precise  functionality  that  we  would  like  is  lacking.  There  will  be  an 
awkward  period,  probably  lasting  decades,  during  which  some  but  not  all  compilers  will  be 
able  to  perform  comparably  to  the  BLAS. 

2.3.6  Operating  Systems 

Operating  systems  are  largely  irrelevant  to  serial  codes  such  as  LAPACK  but  they 
can  have  a  significant  impact  on  parallel  codes.  Consider,  for  example,  the  broadcast 
capability  inherent  in  Ethernet  hardware.  That  capability  is  not  available  because  the 
TCP/IP  protocol  does  not  allow  access  to  that  capability.  Furthermore,  at  least  90%  of 
the  message  latency  cost  is  attributable  to  software  and  the  operating  system  often  makes 
it  difficult  to  reduce  the  message  latency  cost.  Part  of  the  NOW[3]  project  involves  finding 
ways  to  reduce  the  large  message  latency  cost  inherent  in  Unix  operating  systems  through 
using  user-level  to  user-level  communications,  avoiding  the  operating  system  entirely. 

2.4  Hardware 

2.4.1  Processor 

The  processor,  or  more  specifically  the  floating  point  unit,  is  the  fundamental 
source  of  processing  power  or  the  ultimate  limit  on  performance,  depending  on  your  point 
of  view.  The  combined  speed  of  all  of  the  floating  point  units  is  the  peak  performance, 
or  speed  of  light,  for  that  computer.  For  many  dense  linear  algebra  codes,  the  number  of 
floating  point  operations  cannot  be  reduced  substantially  and  lienee  the  goal  is  to  perform 
the  necessary  flops  as  fast  (i.e.  as  close  to  the  peak  performance)  as  possible. 

Floating  point  arithmetic 

The  increasing  adherence  to  the  IEEE  standard  754  for  binary  floating  point 
arithmetic/?]  benefits  performance  in  two  ways:  it  reduces  the  effort  needed  to  make  codes 
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work  across  multiple  platforms  and  it  allows  one  to  take  advantage  of  details  of  the  under¬ 
lying  arithmetic  in  a  portable  code.  The  developers  of  LAPACK  had  to  expend  considerable 
effort  to  make  their  codes  work  on  machines  with  non-IEEE  arithmetic,  notably  older  Cray 
machines.  By  contrast,  the  developers  of  ScaLAPACK  chose  to  concentrate  on  IEEE  standard 
754  conforming  machines  allowing  them  not  only  to  avoid  the  hassles  of  old  Cray  arithmetic, 
but  also  to  check  the  sign  bit  directly  when  using  bisection[54]  to  compute  the  eigenvalues 
of  a  tridiagonal  matrix. 

Consistent  floating  point  arithmetic  is  also  important  for  execution  on  heteroge¬ 
neous  machines.  Demmel  et  al. [54]  discuss  ways  to  achieve  correct  results  in  bisection  on  a 
heterogeneous  machine.  I  have  proposed  having  each  process  compute  a  subset  of  eigenval¬ 
ues,  chosen  by  index,  sharing  those  eigenvalues  among  all  processes  and  then  having  each 
process  independently  sort  the  eigenvalues [55]. 

Ironically  the  one  place  where  the  IEEE  standard  754  allows  some  flexibility  has 
caused  problems  for  heterogeneous  machines.  The  IEEE  standard  754  allows  several  options 
for  handling  sub-normalized  numbers,  i.e.  numbers  that  are  too  small  to  be  represented  as 
a  normalized  number.  During  ScaLAPACK  testing  it  was  discovered  that  a  sub-normalized 
number  could  be  produced  on  a  machine  that  adheres  to  the  IEEE  standard  754  completely 
and  that  when  this  number  is  then  passed  to  the  DEC  Alpha  21064  processor,  the  DEC 
Alpha  21064  processor  does  not  recognize  them  as  legitimate  numbers  and  aborts.  To  fix 
this  would  have  required  xdr  to  be  smart  enough  to  recognize  this  unusual  situation4  or 
make  one  of  the  processors  work  in  a  manner  different  from  its  default5. 

2.4.2  Memory 

The  slower  speed  of  main  memory  (as  compared  to  cache  or  registers)  affects 
performance  in  three  ways.  It  reduces  the  performance  of  matrix-matrix  multiply  slightly 
and  greatly  complicates  the  task  of  coding  an  efficient  matrix-matrix  multiply.  It  bounds 
from  below  the  algorithmic  blocking  factor  needed  to  achieve  high  performance  on  matrix- 
matrix  multiply.  And,  it  limits  the  performance  of  BLAS1  and  BLAS2  codes. 

The  last  two  factors  listed  above  combine  in  an  unfortunate  manner:  slow  main 
memory  increases  the  number  of  BLAS1  and  BLAS2  flops  and  reduces  the  rate  at  which  they 
are  executed.  The  number  of  BLAS1  and  BLAS2  flops  are  typically  0(n2  nb),  where  nb  is  the 

4 This  would  slow  down  xdr,  possibly  significantly. 

5This  too  would  result  in  slower  execution. 
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algorithmic  blocking  factor,  which  as  stated  above,  must  be  larger  when  main  memory  is 
slow.  The  ratio  of  peak  floating  point  performance  to  main  memory  speed  is  large  enough 
on  some  machines  that  the  0(n2  nb)  cost  of  the  BLAS1  and  BLAS2  flops  can  no  longer  be 
ignored. 

Improving  the  load  balance  of  the  0(n2  nb)  BLAS1  and  BLAS2  flops. 

In  a  blocked  dense  linear  algebra  transformation,  such  as  LIT  decomposition, 
C'liolesky  or  QR,  there  are  0(n2  nb)  BLAS1  and  BLAS2  flops[30,  53].  PDSYEVX  includes  two 
blocked  dense  linear  algebra  transformations:  Reduction  to  tridiagonal  form,  PDSYTRD,  is 
described  in  Section  4.2  and  back  transformation,  PDORMTR,  is  described  in  Section  4.4. 

In  ScaLAPACK  version  1.5,  the  0(n2  nb)  BLAS1  and  BLAS2  flops  are  performed  by 
just  one  row  or  column  of  processors.  This  leads  to  load  imbalance  and  causes  these  flops 
to  account  for  0(  )  execution  time.  If  these  flops  can  be  performed  on  all  p  processors, 

instead  of  just  one  row  or  column,  they  will  account  for  only  0(  n  )  execution  time. 

There  are  two  ways  to  spread  the  cost  of  the  0(n2  nb)  BLAS1  and  BLAS2  flops 
over  all  the  processors:  take  them  out  of  the  critical  path  or  distribute  them  over  all 
processors.  Transformations  such  as  LIT,  and  back  transformation  (applying  a  series  of 
householder  vectors)  can  be  pipelined,  allowing  each  processor  column  (or  row)  to  execute 
asynchronously.  Pipelining  in  turn  allows  lookahead,  a  process  by  which  the  active  column 
performs  only  those  computations  in  the  critical  path  before  sending  that  data  on  to  the 
next  column[32]. 

Distributing  the  BLAS1  and  BLAS2  flops  over  all  of  the  processors,  as  discussed  in 
the  last  paragraph,  requires  a  different  data  distribution,  a  different  broadcast  and  a  sig¬ 
nificant  change  to  the  code.  The  difference  can  be  best  illustrated  by  considering  LIT.  In  a 
2D  blocked  LIT,  LIT  is  first  performed  on  a  block  of  columns,  and  the  resulting  LIT  decom¬ 
position  is  broadcast,  or  spread  across,  to  all  processor  columns.  One  way  to  broadcast  k 
elements  to  p  processors  is  to  combine  a  Reduce  scatter  (which  takes  k  elements  and  sends 
k/p  to  each  processor)  with  an  Allgather  (which  takes  k/p  elements  from  each  processor  and 
spreads  them  out  to  all  processors  giving  each  processor  a  copy  of  all  k  elements).  There 
are  three  ways  to  perform  LIT  on  this  column  block  of  data:  1)  Before  the  column  block  is 
broadcast  to  all  processors  (as  ScaLAPACK  does)  in  which  case  only  the  current  column  of 
processors  is  involved  in  performing  the  column  LIT  and  the  Reduce_scatter  and  Allgatlier 
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combine  to  broadcast  the  block  LIT  decomposition.  2)  After  the  broadcast,  in  which  case 
the  R.educe_scatter  and  Allgatlier  combine  to  broadcast  the  block  column  prior  to  the  LIT 
decomposition  -  all  processor  columns  would  have  a  copy  of  the  block  column  and  each  pro¬ 
cessor  column  could  perform  the  column  block  LIT  redundantly.  3 )  After  the  Reduce_scatter 
but  before  the  Allgatlier.  In  this  case,  the  Reduce_scatter  operates  on  the  column  block 
prior  to  the  LIT  decomposition  but  the  Allgatlier  operates  on  the  block  column  after  the 
LIT  decomposition.  All  processors  can  be  involved  in  the  LIT  decomposition. 

In  HJS,  Hendrickson,  Jessup  and  Smith’s  symmetric  eigensolver[91,  154]  discussed 
in  Section  7.1.2,  the  BLAS1  and  BLAS2  flops  are  analogously  distributed  over  all  of  the 
processors. 

Lookahead  does  not  improve  performance  unless  the  execution  of  the  code  is 
pipelined,  i.e.  proceeds  in  a  wave  pattern  over  the  processes.  Two-sided  reductions,  like 
tridiagonal  reduction,  do  not  allow  pipelining.  And,  pipelining  may  be  limited  on  reductions 
of  symmetric  or  Hermitian  matrices  (such  as  C'liolesky)6. 

Memory  size 

The  amount  of  main  memory  limits  the  size  of  the  problem  that  can  be  executed 
efficiently,  while  the  amount  of  virtual  memory  limits  the  size  of  the  problem  that  can  be 
run  at  all.  ScaLAPACK’s  symmetric  eigensolvers,  PDSYEVX  and  PDSYEV  require  roughly  4 n2 
and  '2n2  double  precision  words  of  virtual  memory  space  respectively.  However,  both  can 
be  run  efficiently  provided  that  physical  memory  can  contain'  the  n2 / 2  elements  of  the 
triangular  matrix  A.  Ed  D’Azevedo[52]  has  written  an  out-of-core  symmetric  eigensolver 
for  ScaLAPACK  and  studied  the  performance  of  PDSYEV  and  PDSYEVX  on  large  problem  sizes. 

2.4.3  Parallel  computer  configuration 

I  will  discuss  primarily  distributed  memory  computers  with  one  processor  per 
node,  discussing  shared  memory  computers  (SMPs),  clusters  of  workstations  and  clusters 
of  shared  memory  computers  only  briefly. 

Four  machine  characteristics  are  important  for  distributed  memory  computers: 
peak  floating  point  performance,  software  overhead,  communication  latency  and  conimu- 

6I  believe  that  pipelining  can  be  used  in  Cholesky  if  a  square  processor  grid  is  used.  Work  in  progress. 

‘Depending  on  the  page  size,  keeping  an  n  by  n  triangular  matrix  in  memory  may  require  as  few  as  n2  /  2 
memory  locations  (if  the  page  size  is  1)  or  as  many  as  n2  (if  the  page  size  is  >  n). 
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nication  (bisection)  bandwidth.  Software  overhead  and  communication  latency  are  the 
dominant  costs  for  small  problems8.  Peak  floating  point  performance  is  the  dominant  costs 
for  large  problems  9. 

Interconnection  network 

Bisection  bandwidth  and  communication  latency  are  the  two  important  measures 
of  an  interconnection  network.  Networks  which  allow  only  one  pair  of  nodes  to  communicate 
at  a  time  do  not  offer  adequate  bisection  bandwidth  and  lienee  parallel  dense  linear  algebra 
(with  the  possible  exception  of  huge  matrix-matrix  multiplies)  will  not  perform  well  on  such 
a  network. 

As  long  as  the  bisection  bandwidth  is  adequate,  the  topology  of  the  interconnection 
network  has  not  proven  to  be  an  important  factor  in  the  performance  of  parallel  dense  linear 
algebra. 

Shared  Memory  Multiprocessing 

Users  of  dense  linear  algebra  codes  have  two  choices  on  shared  memory  multi¬ 
processors.  They  can  use  a  serial  code,  such  as  LAPACK  that  has  been  coded  in  terms  of 
the  BLAS  and,  provided  that  the  manufacturer  has  provided  an  optimized  BLAS,  they  will 
achieve  good  performance.  Or,  provided  that  the  manufacturer  provides  MPI[65],  PVM[19] 
or  the  BLACS  they  can  use  ScaLAPACK. 

LeBlanc  and  Markatos[118]  argue  that  shared  memory  codes  typically  get  better 
load  balance  while  message  passing  codes  typically  incur  lower  communications  cost.  How¬ 
ever,  the  real  difference  could  well  come  down  to  a  matter  of  how  efficient  the  underlying 
libraries  are. 

Clusters  of  workstations 

Some  clusters  of  workstations,  notably  the  NOW  project[3]  at  Berkeley,  offer  com¬ 
parable  communication  performance  to  distributed  memory  computers.  However,  the  vast 
majority  of  networks  of  workstations  in  present  use  are  still  connected  by  Ethernet  or  FDDI 

8On  current  architectures,  n  <  100  p  is  small  for  our  purposes 

9On  current  architectures,  n  >  1000  ^/p  is  large  for  our  purposes 
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rings  and  lienee  do  not  have  the  low  latency  and  high  bisection  bandwidth  required  to  per¬ 
form  dense  linear  algebra  reductions  in  parallel  efficiently. 

Cluster  of  SMPs  (CLUMPS) 

Dense  linear  algebra  codes  have  two  choices  on  clusters  of  SMPs:  they  can  assign 
one  process  to  each  processor  or  they  can  assign  one  process  to  each  multi-processor  node. 
The  tradeoff  will  be  similar  to  the  shared-memory  versus  message-passing  question  on  shared 
memory  computers. 

If  each  processor  is  assigned  a  separate  process  the  details  of  how  the  processes 
will  be  assigned  to  wliat  is  essentially  a  two  level  grid  of  processors  will  be  important. 
For  a  modest  cluster  of  SMPs  (say  4  nodes  each  with  4  processors)  it  might  make  sense 
to  assign  one  dimension  within  the  node  and  the  other  across  the  nodes.  However,  this 
will  not  scale  well  -  adding  nodes  will  require  increasing  the  bandwidth  per  node  else  all 
dense  linear  algebra  transformations  will  become  bandwidth  limited  as  the  number  of  nodes 
increases.  A  layout  which  that  is  2  dimensional  within  the  nodes  and  2  dimensional  among 
the  nodes  allows  both  the  number  of  processors  per  node  and  the  number  of  nodes  to 
increase  provided  only  that  bisection  bandwidth  grow  with  the  number  of  processors  and 
that  internal  bisection  bandwidth  (i.e.  main  memory  bandwidth)  grows  with  the  number 
of  processors  per  node. 

On  the  first  CLUMPS,  how  well  each  of  the  libraries  is  implemented  is  likely  to 
outweigh  theoretical  considerations.  Shared  memory  BLAS  are  not  trivial,  nor  will  communi¬ 
cation  systems  that  properly  handle  two  levels  of  processor  hierarchy  be,  i.e.  communication 
within  a  node  and  communication  between  nodes. 

On  most  distributed  memory  systems,  the  logical  to  physical  processor  grid  map¬ 
ping  is  of  secondary  importance.  I  suspect  that  this  will  not  be  the  case  for  clusters  of 
SMPs.  It  will  be  important  to  have  the  processes  assigned  to  the  processors  on  a  particular 
node  nearby  in  the  logical  process  grid  as  well. 

2.5  Applications 

Large  symmetric  eigenproblems  are  used  in  a  variety  of  applications.  Some  of 
these  applications  include:  real-time  signal  processing[156]  [34],  modeling  of  acoustic  and 
electro-magnetic  waveguides[114],  quantum  cliemistry[74]  [22,  175],  numerical  simulations 
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of  disordered  electronic  systems[95],  vibration  mode  superposition  analysis[18],  statisti¬ 
cal  mechanics[132],  molecular  dynamics[152],  quantum  Hall  systems[112,  106],  material 
science[166],  and  biopliysics[143,  144]. 

The  needs  of  these  applications  differ  considerably.  Many  require  considerable 
execution  time  to  build  the  matrix  and  lienee  the  eigensolution  remains  a  modest  part  of  the 
total  execution.  However,  building  the  matrix  often  parallelizes  easily  and  grows  much  more 
slowly  than  the  0(n3)  cost  of  eigensolution.  Hence,  for  these  applications,  the  eigensolver 
becomes  the  bottleneck  as  larger  problems  are  solved  in  parallel.  Few  applications  require 
the  entire  spectrum,  but  most  of  these  listed  above  require  at  least  10%  of  the  spectrum 
and  lienee  are  best  solved  by  dense  techniques.  Some  have  large  clusters  of  eigenvalues [74], 
while  others  do  not. 

2.5.1  Input  matrix 

Three  features  of  the  input  matrix  affect  the  execution  time  of  symmetric  eigen- 
solvers:  sparsity,  eigenvalue  clustering  and  spectral  diagonal  dominance. 

Sparsity 

Some  algorithms  and  codes  are  specifically  designed  for  sparse  input  matrices. 
Lanczos[49]  has  traditionally  been  used  to  find  a  few  eigenvalues  and  eigenvectors  at  the 
ends  of  the  spectrum.  Recently,  AR.PACK[119],  and  PARPAC'K[130]  have  been  developed 
based  on  Lanczos  with  full  re-orthogonalization.  They  can  therefore  compute  as  much  of 
the  spectrum  as  the  user  chooses. 

The  Invariant  Subspace  Decomposition  Approach  and  reduction  to  tridiagonal 
form  based  algorithms  can  both  be  run  from  either  a  dense  or  banded  matrix.  In  this 
dissertation,  I  discuss  only  dense  matrices. 

Spectrum 

Some  algorithms  are  more  dependent  on  the  spectrum  than  others.  Most  are 
dependent  in  some  manner,  but  that  dependence  differs  from  one  algorithm  to  another. 

It  is  difficult  to  maintain  orthogonality  of  the  eigenvectors  when  computing  the 
eigendecomposition  of  matrices  with  tight  clusters  of  eigenvalues.  Such  matrices  require 
special  techniques  in  divide  and  conquer  and  in  inverse  iteration  (See  section  2.7.4).  On 
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the  other  hand,  divide  and  conquer  experiences  the  most  deflation,  and  lienee  the  greatest 
efficiency,  on  matrices  with  clustered  eigenvalues. 

The  Invariant  Subspace  Decomposition  Approach  maintains  orthogonality  on  ma¬ 
trices  with  clustered  eigenvalues.  However,  it  may  have  difficulty  picking  a  good  split  point 
if  the  clustering  causes  the  eigenvalues  to  be  unevenly  distributed. 

Spectral  Diagonal  dominance 

Spectral  diagonal  dominance10  speeds  convergence  of  the  Jacobi  algorithm.  In¬ 
deed,  if  the  input  matrix  is  sufficiently  diagonally  dominant,  Jacobi  may  converge  in  as 
little  as  two  steps  (versus  10  to  20  for  non  diagonally  dominant  matrices).  But,  spectral 
diagonal  dominance  has  little  effect  on  any  of  the  other  algorithms. 

2.5.2  User  request 

The  portion  of  the  spectrum  that  the  user  needs,  i.e.  the  number  of  eigenvalues 
and/or  eigenvectors,  affects  execution  time  of  some,  but  not  all  eigensolvers. 

Two  step  band  reduction  (to  tridiagonal  form)  is  most  attractive  when  only  eigen¬ 
values  are  requested  because  the  back  transformation  task  is  expensive  in  two  step  band 
reduction. 

The  cost  of  bisection  and  inverse  iteration  depends  upon  the  number  of  eigenvalues 
and  eigenvectors  requested.  These  costs  are  0(n2)  and  generally  not  significant  for  large 
problem  sizes.  However,  back  transformation  requires  '2n2m  flops  where  m  is  the  number 
of  eigenvectors  required. 

Iterative  methods,  such  as  Lanczos[49]  and  implicitly  restarted  Lanczos[119]  are 
clearly  superior  if  only  a  few  eigenvectors  are  required. 

10 Spectrally  diagonally  dominant  means  that  the  eigenvector  matrix,  or  a  permutation  thereof,  is  diago¬ 
nally  dominant.  Most,  but  not  all,  diagonally  dominant  matrices  are  spectrally  diagonally  dominant.  For 
example  if  you  take  a  dense  matrix  with  elements  randomly  chosen  from  [—1,  1]  and  scale  the  diagonal 
elements  by  le3  the  resulting  diagonally  dominant  matrix  will  be  spectrally  diagonally  dominant.  However, 
if  you  take  that  same  matrix  and  add  le3  to  each  diagonal  element,  the  eigenvector  matrix  is  unchanged 
even  though  the  matrix  is  clearly  diagonally  dominant. 
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2.5.3  Accuracy  and  Orthogonality  requirements. 

Demmel  and  Veselic,[58]  prove  that  on  scaled  diagonally  dominant  matrices11, 
Jacobi  can  compute  small  eigenvalues  with  high  relative  accuracy  while  tri diagonal  based 
methods  can  fail  to  do  so. 

At  present,  the  ScaLAPACK  offers  two  symmetric  eigensolvers:  PDSYEVX  and  PDSYEV. 
PDSYEVX,  which  is  based  on  bisection  and  inverse  iteration  (DSTEBZ  and  DSTEII  from  LAPACK) 
is  faster  and  scales  better  but  does  not  guarantee  orthogonality  among  eigenvectors  asso¬ 
ciated  with  clustered  eigenvalues.  PDSYEV,  which  is  based  on  QR  iteration  (DSTEQR  from 
LAPACK)  is  slower  and  does  not  scale  as  well  but  does  guarantee  orthogonal  eigenvectors. 

2.5.4  Input  and  Output  Data  layout 

At  present,  the  execution  time  of  the  ScaLAPACK  symmetric  eigensolver  is  strongly 
dependent  on  the  data  layout  chosen  by  the  user  for  input  and  output  matrices.  ID  data 
layouts  are  not  scalable  and  lead  to  both  high  communication  costs  and  poor  load  balancing. 
Suboptimal  block  sizes  can  likewise  affect  performance  significantly.  In  particular,  a  block 
size  of  1,  i.e.  cyclic  data  layout,  causes  ScaLAPACK  to  send  a  large  number  of  small  messages 
resulting  in  unacceptable  message  latency  costs  and  a  huge  number  of  calls  to  the  BLAS.  If 
the  block  size  is  too  large,  load  balance  suffers. 

There  are  a  couple  ways  to  reduce  this  dependence  on  the  data  layout  chosen  by 
the  user.  If  algorithmic  blocking  is  separated  from  data  layout  blocking[140]  [91]  [159]  small 
data  layouts  can  be  handled  much  more  efficiently.  However,  small  block-sizes  (especially 
cyclic  layouts)  still  require  more  messages  than  larger  block-sizes.  And,  large  block  sizes 
still  lead  to  load  imbalance. 

In  Chapter  8  I  will  show  that  redistributing  the  data  to  an  internal  format  that 
is  near  optimal  for  the  particular  machine  and  algorithm  involved  allows  for  improved 
performance  and  performance  that  is  independent  of  the  input  and  output  data  layout. 

2.6  Machine  Load 

The  load  of  the  machine,  in  addition  to  the  direct  effect  of  offering  your  program 
only  a  portion  of  the  total  cycles,  can  have  several  indirect  effects.  If  each  processor  is 

nA  matrix,  A ,  is  scaled  diagonally  dominant  if  and  only  if  DAD  with  D  =  \diaij(A)\1^2  is  diagonally 
dominant. 
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individually  scheduled,  performance  can  be  arbitrarily  poor  because  significant  progress  is 
only  possible  when  all  processes  are  concurrently  scheduled.  A  loaded  machine  may  also 
cause  your  data  to  be  swapped  out  to  disk,  which  can  greatly  reduce  peak  performance. 
Finally,  it  is  the  most  heavily  loaded  machine  which  controls  execution  time.  If  your  code 
is  running  on  9  unloaded  processors  and  one  processor  with  a  load  factor  of  5,  you  will 
get  no  more  than  a  factor  of  10/5  speedup.  A  ScaLAPACK  user  has  reported  performance 
degradation  and  speedup  less  than  1,  (i.e.  more  processors  take  longer  to  comlete  the  same 
sized  eigendecomposition)  on  the  IBM  IBM  SP2.  I  have  also  witnessed  this  behaviour  on 
the  IBM  IBM  SP2at  the  University  of  Tennesse  at  Knoxville  and  I  have  reason  to  suspect 
that  the  IBM  IBM  SP2  isnot  gang  scheduled  and  that  this  fact  accounts  for  a  large  part  of 
the  poor  performance  of  PDSYEVX  that  the  user  and  I  have  witnessed  on  the  IBM  SP2. 

Space  sharing,  allocating  subsets  of  the  processors,  solves  all  of  these  problems, 
but  has  its  own  problems.  On  some  machines,  jobs  running  on  different  partitions  share 
the  same  communications  paths  and  lienee  if  one  job  saturates  the  network,  all  jobs  may 
suffer. 


2.7  Historical  notes 

2.7.1  Reduction  to  tridiagonal  form  and  back  transformation 

Householder  reduction  to  tridiagonal  form  is  a  two-sided  reduction,  which  requires 
multiplication  by  Householder  reflectors  from  both  the  left  and  right  side.  Martin  et  al.  im¬ 
plemented  reduction  to  tridiagonal  form  in  Algol[129].  TRED1  and  TRED2  perform  reduction 
to  tridiagonal  form  in  EISPACK[153].  Dongarra,  Hammariing  and  Sorensen[64]  showed  that 
Householder  reduction  to  tridiagonal  form  can  be  performed  using  half  matrix- vector  and 
half  matrix-matrix  multiply  flops.  This  has  been  implemented  as  DSYTRD  in  LAPAC'K[5,  67] 
for  scalar  and  shared  memory  multiprocessors  and  PDSYTRD  for  distributed  memory  com¬ 
puters  in  ScaLAPACK[42].  C'liang  et  al.  implemented  one  of  the  first  parallel  codes  for 
reduction  to  tridiagonal  form,  first  using  a  ID  cyclic  data  layout [37]  and  then  a  2D  cyclic 
data  layout [38]. 

Smith,  Hendrickson  and  Jessup[91]  show  that  data  blocking  is  not  required  for 
efficient  algorithmic  blocking  and  that  PDSYTRD  pays  a  substantial  execution  time  penalty 
for  its  generality  (accepting  any  processor  layout)  and  portability  (being  built  on  top  of 


31 


the  PBLAS,  BLACS  and  BLAS).  By  restricting  tlieir  attention  to  square  processor  layouts  on 
the  PARAGON,  they  were  able  to  dramatically  reduce  the  overhead  incurred  in  reduction  to 
tridiagonal  form  in  HJS.  HJS  does  not  have  the  redundant  communication  found  in  PDSYEVX, 
it  makes  many  fewer  BLAS  calls,  avoids  the  overhead  of  the  PBLAS  calls,  and  spreads  the 
work  more  evenly  among  all  the  processors  (improving  load  balance).  Furthermore,  HJS, 
by  using  communication  primitives  better  suited  to  the  task,  reduces  both  the  number  of 
messages  sent  and  the  total  volume  of  communication  substantially.  Some,  but  not  all, 
of  these  advantages  necessitate  that  the  processor  layout  be  square.  HJS  is  discussed  in 
Section  7.1.2. 

Other  ways  to  reduce  the  execution  time  of  reduction  to  tridiagonal  form  do  not 
require  that  the  processor  layout  be  square.  Biscliof  and  Sun[25]  and  Lang[116]  showed  that 
in  a  two  step  band  reduction  to  tridiagonal  form,  all  of  the  flops,  asymptotically,  can  be 
performed  in  matrix  multiply  routines.  Karp,  Sahay,  Santos  and  Scliauser[107]  showed  that 
subset  broadcasts  and  reductions  can  be  performed  optimally.  Van  de  Geijn  and  otliers[16] 
are  working  to  implement  improved  subset  broadcast  and  reduction  primitives. 

Hegland  et  al.[90]  argue  that  the  fastest  way  to  reduce  a  symmetric  matrix  A  to 
tridiagonal  form  on  the  VPP500  (a  multiprocessor  vector  supercomputer  by  Fujitsu)  is  to 
compute  LiDL,t  =  A  and  then  compute  a  series  of  X;  using  orthonormal  transformations 
such  that  Xn+p_iiJX^+p_1  is  tridiagonal.  Their  technique  is,  in  essence,  a  two  step  band 
reduction  in  which  the  two  steps  are  performed  within  the  same  loop.  Let  X;[:,  own(p)] 
represent  the  columns  of  X;,  owned  by  processor  p.  pQi  means  the  portion  of  Qi  which 
processor  p  owns. 

The  code  is: 

XiX>Xf  =  A 

For  i  =  1  to  n  —  1  do: 

Each  processor  independently  performs: 

pQi  =  House(Xi[:,own(p)]X)i[own(p),own(p)]Xi[:,own(p)]r) 

X8+i[:,own(p)]  =p  QiLi[po\Nn(p)] 

The  processors  together  perform: 

Allgather(Xi+i[:,  *  +  1  :  i  +  p]) 

Each  processor  performs  redundantly: 

Qi  =  House(Xi+i[:,  ■/  +  1  :  i  +  p\D[i  +!:■/  +  p,  *  +  1  :  i  +  p]Xi+i[:,  ■/  +  1  :  i  +  p]T ) 
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Li+ 1[:,  i  +  1  :  *  +  p]  —  Q\Li+ 1  [: ,  i  +  1  :  i  +  p] 

In  Allgather(Xj-+i[:,  *  +  1  :  i  +  p])  each  processor  contributes  the  column  of 
*  +  1  :  i  +  p]  which  it  owns  and  all  processors  end  up  with  identical  copies  of  Xi+i[: 
,  *  +  1  :  i  +  p]  ■ 

The  loop  invariants  are  as  follows: 

Let:  Ti  =  (Li)D(Lif 

Vjci-,k<i+PTi(j,  k)  =  0  ( Line  1 ) 

Ti(  1  :  i  —  p,  1  :  i  —  p) is  tri diagonal  (Line  2) 

For  p  =  1,  the  serial  case,  both  of  these  conditions  are  identical  and  meeting  them 
requires  computing  the  first  column  of  (Li)D(Li)T ,  computing  the  Householder  vector  and 
applying  it  to  X;  to  yield  Xi+i. 

For  p  >  1,  the  parallel  case,  the  first  loop  invariant  is  maintained  by  each  processor 
independently  computing  the  first  column  of  (Li)D(Li)T ,  using  only  the  local  columns12 
of  Li.  A  Householder  vector  is  computed  from  this  and  applied  to  the  local  columns  of 
Li.  The  second  loop  invariant  is  maintained  redundantly  on  all  processors.  All  processors 
obtain  copies  of  columns  i  to  i  +  p  —  1  of  /.;  and  compute:  A(  1  :  p,  1 )  =  X;(  i  :  i  +  p  —  1,  i  : 
i  +  p  —  I  )D(  i  :  i  +  p  —  1,  i  :  i  +  p  —  1  )L(  i  :  i  +  p  —  1,  i  )T .  A  Householder  vector  is  computed 
from  A(1  :  p,  1 )  and  applied  to  X;(  *  :  i+p  —  1, :),  redundantly  on  all  processors,  maintaining 
the  second  loop  invariant. 

This  one-sided  transformation  requires  fewer  messages  than  Hessenberg  reduction 
to  tridiagonal  form  and,  for  small  p,  less  message  volume,  but  requires  twice  as  many  flops. 

2.7.2  Tridiagonal  eigendecomposition 
Sequential  symmetric  QL  and  QR  algorithms 

The  implicit  QL  or  QR.  algorithms  have  been  the  most  commonly  used  methods  for 
solving  the  symmetric  eigenproblem  for  the  last  couple  decades.  Francis [79]  wrote  the  first 
implementation  of  the  QL  algorithm  based  on  R.utishauser’s  LR.  transformation.  The  QL 
algorithm  is  the  basis  of  the  EISPACK  routine  IMTQL1,  while  the  LAPACK  routine  DSTEQR  uses 
either  implicit  QR.  or  implicit  QL  depending  on  the  top  and  bottom  diagonal  elements[86]. 

12Their  implementation  uses  a  column  cyclic  data  distribution. 
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Henry [93]  shows  that  if  between  each  sweep  of  QR  (or  QL)  in  which  the  eigenvectors  are 
updated  an  additional  sweep  is  performed  in  which  the  eigenvectors  are  not  updated,  better 
shifts  can  be  used,  reducing  the  total  number  of  flops  from  roughly  6 n3  to  4n3. 

R.einsch[145]  wrote  EISPACK’s  TQLRAT  which  computes  eigenvalues  without  square 
roots.  LAPACK’s  DSTERF  improves  on  TQLRAT  using  a  root  free  variant  developed  by  Pal, 
Walker  and  Kahan[134].  Like  DSTEQR,  DSTERF  uses  either  implicit  QR.  or  implicit  QL  de¬ 
pending  on  the  top  and  bottom  diagonal  elements 

Parallel  symmetric  QL  and  QR  algorithms 

QR.  requires  0(n2)  effort  to  compute  the  eigenvalues  and  0(n3)  to  compute  the 
eigenvectors.  No  one  lias  found  a.  good,  stable  way  to  parallelize  the  0(n2 )  cost  of  computing 
the  eigenvalues  and  reflectors.  Sa.meli  and  Kuck[113]  use  parallel  prefix  to  parallelize  QR. 
for  eigenvalue  extraction.  They  obtain  0{  )  speedup,  but  they  do  not  show  how  their 

method  can  be  used  to  generate  reflectors  and  lienee  eigenvectors. 

However,  parallelizing  the  0(n3)  effort  of  computing  the  eigenvectors  is  straight¬ 
forward  as  shown  by  Chinchalkar  and  Colema.n[39];  and  Arbenz  et  a.l. [8]  and  implemented 
for  ScaLAPACK  by  Fellers[76]. 

Symmetric  QR.  parallelizes  nicely  in  a.  MIMD  programming  style,  but  efforts  to 
parallelize  it  on  a.  shared  memory  machine  in  which  the  parallelism  is  strictly  within  the 
calls  to  the  BLAS  have  produced  only  modest  speedups.  Ba.i  and  Demmel[13]  first  suggested 
using  multiple  shifts  in  lion-symmetric  QR..  Arbenz  and  Oettli[10]  showed  that  blocking 
and  multiple  shifts  could  be  used  to  obtain  modest  improvements  in  the  speed  (roughly  a. 
factor  of  2  on  8  processors)  of  QR.  for  eigenvalues  and  eigenvectors  on  the  ALLIANT  FX/80. 
Ka.ufma.n[109]  showed  that  multi-shift  QR.  could  be  used  to  speed  eigenvalue  extraction  by 
a.  factor  of  3  on  a.  2-processor  Cray  YMP  despite  tripling  the  number  of  flops  performed. 

Sturm  sequence  methods 

Givens [83]  used  bisection  to  compute  the  eigenvalues  of  a.  tri diagonal  matrix  based 
on  Wilkinson’s  original  idea..  Ka.lian[105]  showed  that  bisection  can  compute  small  eigenval¬ 
ues  with  tiny  componentwise  relative  backward  error,  and  sometimes  high  relative  accuracy. 
High  relative  accuracy  is  required  for  inverse  iteration  on  a.  few  matrices.  Barlow  and  Evans 
were  the  first  to  use  bisection  in  a.  parallel  code[15]. 
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Computing  eigenvalues  of  a  tri diagonal  matrix  can  be  split  into  three  phases: 
isolation,  separation  and  extraction.  The  isolation  phase  identifies,  for  each  eigenvalue,  an 
interval  which  contains  that  eigenvalue  and  no  other.  The  separation  phase  improves  the 
eigenvalue  estimate.  And  the  extraction  phase  computes  the  eigenvalue  to  within  some 
tolerance.  Bisection  can  be  used  for  all  three  phases. 

Neither  existing  codes,  nor  the  literature  explicitly  distinguish  between  these  three 
phases,  but  they  have  very  different  computational  aspects.  Isolation,  at  least  to  the  point 
of  identifying  p  intervals  so  that  each  processor  is  responsible  for  one  interval  is  difficult 
to  parallelize,  whereas  the  other  phases  are  fairly  straightforward.  The  separation  phase  is 
typically  the  challenge  for  most  root  finders,  and  the  area  where  they  distinguish  themselves 
from  other  codes.  Divide  and  conquer  techniques  which  use  the  eigenvalues  from  perturbed 
matrices  as  estimates  of  the  eigenvalues  of  the  original  matrix,  isolate  and  may  separate  the 
roots. 

Techniques  for  eigenvalue  isolation  include:  multi-section[126]  [14],  assigning  dif¬ 
ferent  parts  of  the  spectrum  to  different  processors[95,  20],  divide  and  conquer  and  using 
multiple  processors  to  compute  the  inertia  of  a  tridiagonal  matrix[123].  In  multi-section, 
each  processor  computes  the  inertia  at  a  single  point,  splitting  an  interval  into  p  +  1  in¬ 
tervals.  Although  multi-section  requires  communication,  C'rivelli  and  Jessup[48]  show  that 
the  communication  cost  is  often  a  modest  part  of  the  total  cost.  Divide  and  conquer  splits 
the  matrix  by  perturbing  or  ignoring  a  couple  of  elements,  typically  near  the  center  of  the 
matrix  to  separate  the  matrix  into  two  tridiagonal  matrices  whose  eigenvalues  can  be  com¬ 
puted  separately.  If  a  rank  1  perturbation  is  chosen,  the  merged  set  of  eigenvalues  provides 
a  set  of  intervals  in  which  exactly  one  eigenvalue  lies. 

There  are  a  number  of  ways  to  use  multiple  processors  to  compute  the  inertia  of 
a  tridiagonal  matrix.  Lu  and  Qiao[127]  discuss  using  parallel  prefix  to  compute  the  Sturm 
sequence  as  the  sub-products  of  a  series  of  2  by  2  matrices  and  Matliias[131]  did  an  error 
analysis  and  showed  that  it  was  unstable.  Ren[146]  tried  unsuccessfully  to  repair  parallel 
prefix.  Conroy  and  Podrazik[46]  perform  LU  on  a  block  arrowhead  matrix.  Each  block  is 
tridiagonal  and  the  arrow  has  width  equal  to  the  number  of  blocks.  Swarztrauber[162]  and 
Krislinakumar  and  Morfflll]  discuss  ways  of  computing  the  determinant  of  4  matrices  of 
size  roughly  n  by  n  from  the  determinants  of  8  matrices  of  size  roughly  by  ^n.  Each 
of  these  methods  performs  2  to  4  times  more  floating  point  operations  than  a  serial  Sturm 
sequence  count  would  and  requires  O(log(p))  messages.  Except  for  Conroy  and  Podrazik’s 
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method,  they  all  use  multiplies  instead  of  divides.  Multiplies  are  faster  than  divides,  but 
require  special  checks  to  avoid  overflow. 

The  computation  of  the  inertia  is  slowed  by  the  existence  of  a  divide  and  a  com¬ 
parison  in  the  inner  loop.  There  are  also  a  couple  tricks  that  can  potentially  be  used  to 
speed  computation  of  the  inertia  to  reduce  the  number  of  divides  and  comparisons  or  to 
make  them  faster.  ScaLAPACK’s  PDSYEVX  uses  signed  zeroes  and  the  C  language  ability  to 
extract  the  sign  bit  of  a  floating  point  number  to  avoid  a  comparison  in  the  inner  loop [54]. 
I  have  proposed  perturbing  tiny  entries  in  the  tridiagonal  matrix  to  guarantee  that  negative 
zero  will  never  occur,  thus  allowing  a  standard  C  or  Fortran  comparison  against  zero.  Using 
a  standard  comparison  against  zero  would  allow  compilers  to  produce  more  efficient  code. 
I  have  also  proposed  reducing  the  number  of  divides  in  the  inner  loop  by  taking  advantage 
of  the  fixed  exponent  and  mantissa  sizes  in  IEEE  double  precision  numbers.  I  have  not  im¬ 
plemented  either  of  these  ideas.  Some  machines  have  two  types  of  divide:  a  fast  hardware 
divide  that  may  be  incorrect  in  the  last  couple  bits  and  a  slower  but  correct  software  divide. 
Demmel,  Dliillon  and  Ren[54]  give  a  proof  of  correctness  for  PDSTEBZ,  ScaLAPACK’s  bisection 
code  for  computing  the  eigenvalues  of  a  tridiagonal  matrix,  in  the  face  of  heterogeneity  and 
non-monotonic  arithmetic  (such  as  sloppy  divides).  This  shows  that  bisection  can  be  robust 
even  in  the  face  of  incorrect  divides. 

Many  techniques  that  have  been  used  to  accelerate  eigenvalue  extraction  including: 
the  secant  metliod[33],  Laguerre’s  iteration[138],  Rayleigh  quotient  iteration[163],  secular 
equation  root  hnding[50]  and  liomotopy  continuation[120,  45].  Bassermann  and  Weidner 
use  a  Newton-like  root  finder  called  the  Pegasus  metliod[17].  These  acceleration  techniques 
converge  super-linearly  as  long  as  the  eigenvalues  are  separated. 

Li  and  Ren[121]  accelerate  eigenvalue  separation  in  their  Laguerre  based  root  finder 
by  detecting  linear  convergence  and  estimating  the  effect  of  the  next  several  steps.  Brent  [33] 
discusses  ways  of  separating  eigenvalues  when  the  secant  method  is  used.  Li  and  Zeng  use  an 
estimate  of  the  multiplicity  in  their  root  finder  based  on  Laguerre  iteration[122].  Szyld[163] 
uses  inverse  iteration  with  a  shift  set  to  middle  of  the  interval  known  to  contain  only  one 
eigenvalue  to  separate  eigenvalues  before  switching  to  Rayleigh  quotient  iteration.  C'uppen’s 
method  takes  advantage  of  multiple  eigenvalues  through  deflation. 

Eigenvalue  extraction  can  be  performed  in  parallel  with  no  communication,  or  a 
small  constant  amount  of  communication.  However,  eigenvalue  extraction  can  exhibit  poor 
load  balance,  especially  if  acceleration  techniques  are  used.  Ma  and  Szyld[128]  use  a  task 
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queue  to  improve  load  balance.  Li  and  Ren[121]  minimize  load  imbalance  by  concentrating 
on  worst  case  performance. 

ScaLAPACK  chose  bisection  and  inverse  iteration  for  its  first  tridiagonal  eigensolver, 
PDSYEVX,  because  they  are  fast,  well  known,  robust,  simple  and  parallelize  easily.  ScaLAPACK 
has  since  added  a  QR  based  tridiagonal  eigensolver  for  those  applications  needing  guarantees 
on  orthogonality  within  eigenvectors  corresponding  to  large  clusters  of  eigenvalues.  See 
section  4.3  for  details. 


Divide  and  Conquer 

Cuppen[50]  showed  that  by  making  a  small  perturbation  to  a  tridiagonal  matrix 
it  could  be  split  into  two  separate  tridiagonal  matrices  each  of  which  could  be  solved  inde¬ 
pendently,  and  that  the  eigendecomposition  of  the  original  tridiagonal  matrix  could  then  be 
constructed  from  the  eigendecomposition  of  the  two  independent  tridiagonal  matrices  and 
the  perturbation. 

There  are  many  ways  to  perturb  a  tridiagonal  matrix  such  that  the  result  is 
two  separate  tridiagonal  matrices.  The  following  four  have  been  implemented.  C'uppen’s 
algoritlim[50]  subtracts  auuT  from  the  tridiagonal  matrix,  where  u  =  ein  +  ein+1  and 
a  =  Ti  1,1.  Gu  and  Eisenstat[89l  set  all  elements  in  row  and  column  i  to  zero.  Gates 

2n,2n- 

and  Arbenz[82]  call  this  a  rank-one  extension  and  refer  to  this  as  permuting  row  and  column 
to  the  last  row  and  column  (as  opposed  to  setting  all  elements  in  row  and  column  i  to 
zero).  Gates[80]  uses  a  rank  two  perturbation:  Tin  i n+1(  e i neT  ,  +  ein+1ef  )  is  subtracted 
from  the  original  tridiagonal. 

Cuppen’s  original  divide  and  conquer  method  can  result  in  a  loss  of  orthogo¬ 
nality  among  the  eigenvectors.  Three  methods  of  maintaining  orthogonality  have  been 
implemented.  Sorensen  and  Tang[155]  calculate  the  roots  to  double  precision.  Gu  and 
Eisenstat[89]  compute  the  eigenvectors  to  a  slightly  perturbed  problem.  Gates[81]  showed 
that  inverse  iteration  and  Gram- Schmidt  re-orthogonalization  could  be  used  in  divide  and 
conquer  codes  to  compute  orthogonal  eigenvectors. 

Several  divide  and  conquer  codes  are  available  today.  The  first  publically  available 
divide  and  conquer  code,  TREEQL  was  written  by  Dongarra  and  Sorensen[66].  The  fastest 
reliable  serial  code  currently  available  for  computing  the  full  eigendecomposition  of  a  tridi¬ 
agonal  matrix  is  LAPACK’s  DSTEDC[147].  It  is  based  on  Cuppen’s  divide  and  conquer[50]  and 
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uses  Gu  and  Eisenstat’s[88]  method  to  maintain  orthogonality. 

There  has  long  been  interest  in  parallelizing  divide  and  conquer  codes  because 
of  the  obvious  parallelism  involved  in  the  early  stages.  There  are  three  reasons  why  this 
technique  has  proven  difficult  to  parallelize.  The  first  is  that  the  majority  of  the  flops  are 
performed  at  the  root  of  the  divide  and  conquer  tree  and  lienee  the  parallelism  at  the  leaves 
is  less  valuable[36].  The  second  is  that  deflation,  the  property  that  makes  DSTEDC  the  fastest 
serial  code,  leads  to  dynamic  load  imbalance  in  parallel  codes.  The  third  is  the  complexity 
of  the  serial  code  itself. 

Dongarra  and  Sorensen’s  parallel  code[66],  SESUPD,  was  written  for  a  shared  mem¬ 
ory  machine.  The  first  parallel  divide  and  conquer  codes  written  for  distributed  memory 
computers  used  a  ID  data  layout  (thus  limiting  their  scalability ) [99,  81].  Potter[141]  has 
written  a  parallel  divide  and  conquer  for  small  matrices  (it  requires  a  full  copy  of  the  ma¬ 
trix  on  each  node).  Francoise  Tisseur  has  written  a  parallel  divide  and  conquer  code  for 
inclusion  in  ScaLAPACK. 

Inverse  Iteration 

Inverse  iteration  with  eigenvalue  shifts  is  typically  used  to  compute  the  eigen¬ 
vectors  once  the  eigenvalues  are  known[170].  Jessup  and  Ipsen[102]  explain  the  use  of 
Gram- Schmidt  re-orthogonalization  to  ensure  that  the  eigenvectors  are  orthogonal.  Fanil 
and  Littleheld[75]  found  that  inverse  iteration  and  Gram- Schmidt  can  be  performed  in  par¬ 
allel,  greatly  improving  its  efficiency.  Parlett  and  Dliillon[139,  59]  are  working  on  a  method, 
based  on  work  by  Fernando,  Parlett  and  Dliillon[77],  that  may  avoid,  or  greatly  reduce  the 
need  for  re-orthogonalization. 

The  Jacobi  method 

The  Jacobi  method  for  the  symmetric  eigenproblem  consists  of  applying  a  series 
of  rotators  each  of  which  forces  a  single  off-diagonal  element  to  zero.  Each  such  rotation 
reduces  the  square  of  the  Frobenius  norm  of  the  off-diagonal  elements  by  the  square  of  the 
element  which  was  eliminated.  Hence,  as  long  as  the  off-diagonal  elements  to  be  eliminated 
are  reasonably  chosen,  the  norm  of  the  off-diagonal  converges  to  zero[167]. 

There  are  several  variations  in  the  Jacobi  method.  Classical  Jacobi[100],  selects 
the  largest  off-diagonal  element  as  the  element  to  eliminate  at  each  step,  and  lienee  requires 
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the  fewest  steps.  However,  0(n2)  comparisons  are  required  at  each  step  to  select  the  largest 
element,  requiring  0(n4)  comparisons  per  sweep,  rendering  it  unattractive.  Cyclic  Jacobi 
annihilates  every  element  once  per  sweep  in  some  specified  order.  Threshold  Jacobi  differs 
from  cyclic  Jacobi  in  that  only  those  elements  larger  than  a  given  threshold  are  annihilated. 
Block  Jacobi  annihilates  an  entire  block  of  elements  at  each  step. 

Cyclic,  threshold  and  block  variants  of  Jacobi  each  have  their  advantages.  Cyclic 
Jacobi  is  the  simplest  to  implement.  Block  Jacobi  requires  fewer  flops  (and  if  done  in 
parallel,  fewer  messages)  per  element  annihilated.  Threshold  Jacobi  requires  fewer  steps 
and  converges  more  surely  than  cyclic  Jacobi,  however  a  parallel  threshold  Jacobi  requires 
more  communication.  Scott  et  al.  showed  that  a  block  threshold  Jacobi  metliod[151]  is 
the  best  Jacobi  method  for  distributed  memory  machines,  however,  it  would  also  be  the 
most  complex  to  implement.  Littlefield  and  Masclilioff[125]  found  that  for  large  numbers  of 
processors,  a  parallel  block  Jacobi  beat  tri diagonal  based  methods  available  at  that  time. 

One-sided  Jacobi  methods  apply  rotations  to  only  one  side  of  the  matrix  and  force 
the  columns  of  the  matrix  to  be  orthogonal,  lienee  represent  scaled  eigenvectors.  One-sided 
Jacobi  methods  require  fewer  flops  and  may  parallelize  better[10,  21]. 

Existing  parallel  implementations  of  the  Jacobi  algorithm  are  based  on  a  ID  data 
layout.  Arbenz  and  Oettli[10]  implemented  a  blocked  one-sided  Jacobi.  Pourzandi  and 
Tourancheau[142]  show  that  overlapping  communication  and  computation  is  effective  in 
a  Jacobi  implementation  on  the  i860  based  NC'UBE.  Although  a  ID  data  layout  is  not 
scalable,  the  huge  computation  to  communication  ratio  in  the  Jacobi  algorithm  hides  this 
on  all  machines  available  today. 

There  are  two  publically  available  parallel  Jacobi  codes.  Fernando  wrote  a  parallel 
Jacobi  code  for  NAG[87].  O’Neal  and  Reddy[133]  wrote  a  parallel  Jacobi,  PJAC,  for  the 
Pittsburgh  Supercomputing  Center. 

Denimel  and  Veselic,[58]  prove  that  on  scaled  diagonally  dominant  matrices,  Jacobi 
can  compute  small  eigenvalues  with  high  relative  accuracy  while  tridiagonal  based  methods 
cannot.  Denimel  et  al.[56]  give  a  comprehensive  discussion  of  the  situations  in  which  Jacobi 
is  more  accurate  than  other  available  algorithms. 

The  Jacobi  method  is  discussed  in  Section  7.3. 
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2.7.3  Matrix-matrix  multiply  based  methods 

There  are  several  methods  for  solving  the  symmetric  eigenproblem  which  can  be 
made  to  use  only  matrix-matrix  multiply. 

Matrix-matrix  based  methods  are  attractive  because  they  can  be  performed  ef¬ 
ficiently  on  all  computers,  and  they  scale  well.  However,  they  require  many  more  flops 
(typically  6  -  60  times  more)  than  reduction  to  tridiagonal  form,  tridiagonal  eigensolution 
and  back  transformation.  Hence,  these  methods  only  make  sense  if  tridiagonal  based  meth¬ 
ods  cannot  be  performed  efficiently  or  do  not  yield  answers  that  are  sufficiently  accurate. 

Invariant  Subspace  Decomposition  Algorithm 

The  Invariant  Subspace  Decomposition  Algorithm[97],  ISDA,  for  solving  the  sym¬ 
metric  eigenproblem  involves  recursively  decoupling  the  matrix  A  into  two  smaller  matrices. 
Each  decoupling  is  achieved  by  applying  an  orthogonal  similarity  transformation,  QTAQ , 
such  that  the  first  columns  of  Q  span  an  invariant  subspace  of  A.  Such  a  Q  is  found  by 
computing  a  polynomial  function  of  A,  A'  =  p(A)  which  maps  all  the  eigenvalues  of  A 
nearly  to  0  or  1,  and  then  taking  the  QR  decomposition  of  p(A).  One  such  polynomial 
can  be  computed  by  first  shifting  and  scaling  A  so  that  all  its  eigenvalues  are  known  to 
be  between  0  and  1  (by  Gersligorin’s  theorem)  and  then  repeatedly  computing  the  beta 
function,  Ai+i  =  3 A?  —  2A|,  until  all  of  the  eigenvalues  of  A;  are  effectively  either  0  or  1. 
(All  of  the  eigenvalues  of  Ao  that  are  less  than  0.5  are  mapped  to  0,  all  the  eigenvalues  of 
Ao  that  are  greater  than  0.5  are  mapped  to  1.) 

The  ISDA  parallelizes  well  because  each  of  the  tasks  involved  perform  well  in 
parallel[97].  Unfortunately,  the  ISDA  requires  far  more  floating  point  operations  (roughly 
100  n3)  than  eigensolvers  that  are  based  on  reducing  the  matrix  first  to  tridiagonal  form 
(which  require  8 n3  +  0(n2)  or  fewer  flops). 

Applying  the  ISDA  for  banded  matrices  greatly  reduces  the  flop  count [26].  Fur¬ 
thermore,  the  banded  matrix  multiplications  can  still  be  performed  efficiently,  and  the 
bandwidth  does  not  triple  with  each  application  of  Ai+i  =  3A?  —  2 A3  as  one  would  expect 
with  random  banded  matrices.  Nonetheless,  the  bandwidth  does  grow  enough  to  necessitate 
several  band  reductions,  each  of  which  requires  a  corresponding  back  transformation  step. 

A  publically  available  code  based  on  the  ISDA  is  available  from  the  PRISM 
group  [28]. 
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The  ISDA  applied  directly  to  the  full  matrix  requires  roughly  100 ?? 3  flops,  or  30 
times  as  many  as  tri diagonal  reduction  based  methods,  and  lienee  will  never  be  as  fast. 
Banded  ISDA  is  almost  a  tridiagonal  based  method,  but  is  not  likely  to  be  the  fastest 
method.  The  quickest  way  to  compute  eigenvalues  from  a  banded  matrix  is  to  reduce  the 
matrix  first  to  tridiagonal  form.  And,  if  eigenvectors  are  required,  banded  ISDA  will  require 
at  least  twice  and  probably  three  times  as  many  flops  in  back  transformation. 

FFT  based  invariant  subspace  decomposition 

Yau  and  Lu[174]  implemented  an  FFT  based  invariant  subspace  decomposition 
method.  This  method  requires  O(log(n))  matrix  multiplications.  Tisseur  and  Domas[60] 
have  written  a  parallel  implementation  of  the  Yau  and  Lu  method. 

FFT  based  invariant  subspace  decomposition,  like  ISDA  applied  to  dense  matrices 
requires  roughly  lOOn3  flops.  Hence,  it,  like  ISDA  will  never  be  as  fast  as  tridiagonal 
reduction  based  methods. 

Strassen’s  matrix  multiply 

Strassen’s  matrix-matrix  multiply[157]  can  decrease  the  execution  time  for  very 
large  matrix-matrix  multiplies  by  up  to  20%  but  will  not  make  ISDA  competitive.  Several 
implementations  of  Strassen’s  matrix  multiply  have  been  able  to  demonstrate  performance 
superior  to  conventional  matrix-matrix  multiply [96]  [43].  However,  Strassen’s  method  is  only 
useful  when  performing  matrix-matrix  multiplies  in  which  all  three  matrices  are  very  large 
and  Strassen’s  flop  count  advantage  grows  very  slowly  as  the  matrix  size  grows.  In  order 
to  double  Strassen’s  flop  count  advantage,  the  matrices  begin  multiplied  must  be  sixteen 
times  as  large  and  lienee  memory  usage  must  increase  a  thousand  fold. 

2.7.4  Orthogonality 

Some  methods,  notably  inverse  iteration,  require  extra  care  to  ensure  that  the 
eigenvectors  are  orthogonal.  In  exact  arithmetic,  if  two  eigenvalues  differ,  their  correspond¬ 
ing  eigenvectors  will  be  orthogonal.  However,  if  the  input  matrix  has,  say,  a  double  eigen¬ 
value,  the  eigenvectors  corresponding  to  this  double  eigenvalue  span  a  two-dimensional 
subspace  and  lienee  there  is  no  guarantee  that  two  eigenvectors  chosen  at  random  from 
this  space  will  be  orthogonal.  In  floating  point  arithmetic,  inverse  iteration  without  re- 
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ortliogonalization  may  not  produce  orthogonal  eigenvectors  when  two  or  more  eigenvalues 
are  nearly  identical.  In  DSTEIN,  LAPACK’s  inverse  iteration  code,  when  computing  the  eigen¬ 
vectors  for  a  cluster  of  eigenvalues,  modified  Gram- Schmidt  re-orthogonalization  is  employed 
after  each  iteration  to  re-ortliogonalize  the  iterate  against  all  of  the  other  eigenvalues  in  the 
cluster[102].  Modified  Gram- Schmidt  re-orthogonalization  parallelizes  poorly  because  it  is  a 
series  of  dot  products  and  DAXPY’s  each  of  which  depends  upon  the  result  of  the  immediately 
preceding  operation.  PeIGs[74]  and  PDSYEVX[68]  have  chosen  different  responses  to  the  fact 
that  the  re-orthogonalization  in  DSYEVX  parallelizes  poorly. 

PeIGs  alternates  inverse  iteration  and  re-orthogonalization  in  a  different  manner 
than  DSYEVX.  Instead  of  computing  one  eigenvector  at  a  time,  all  of  the  eigenvectors  within 
a  cluster  are  computed  simultaneously.  For  each  cluster,  PeIGs  first  performs  a  round  of 
inverse  iteration  without  re-orthogonalization  using  random  starting  vectors.  Then,  PeIGs 
performs  modified  Gram- Schmidt  re-orthogonalization  twice  to  ortliogonalize  the  eigenvec¬ 
tors.  PeIGs  performs  a  second  round  of  inverse  iteration  without  re-orthogonalization,  using 
the  output  from  the  previous  step  as  the  starting  vectors,  and  again  repeating  until  sufficient 
accuracy  is  obtained  for  each  eigenvector.  Finally,  PeIGs  performs  modified  Gram- Schmidt 
re-orthogonalization  one  last  time.  They  have  shown  that  this  method  works  on  application 
matrices  with  large  clusters  of  eigenvalues. 

PDSYEVX  attempts  to  assign  the  computation  of  all  eigenvectors  associated  with 
each  cluster  of  eigenvalues  to  a  single  processor.  When  enough  space  is  available  to  accom¬ 
plish  this,  PDSYEVX  produces  exactly  the  same  results  as  DSYEVX.  When  the  user  does  not 
provide  enough  local  workspace  PDSYEVX  relaxes  the  definition  of  cluster  repeatedly  until  it 
can  assign  all  the  computation  of  all  eigenvectors  associated  with  each  cluster  of  eigenvalues 
to  a  single  processor. 

When  the  input  matrix  contains  one  or  more  very  large  clusters  of  eigenvalues, 
PDSYEVX  performs  poorly:  If  enough  workspace  is  available,  PDSYEVX  gives  the  same  results 
as  DSYEVX,  but  runs  very  slowly.  If  insufficient  workspace  is  available,  PDSYEVX  does  not 
guarantee  orthogonality.  Dliillon  explains  the  fundamental  problems  in  inverse  iteration[59]. 

Recently  Parlett  and  Dliillon  have  identified  new  techniques  for  computing  the 
eigenvectors  of  a  symmetric  tridiagonal  matrix[136,  139].  These  new  results  raise  the  hope 
that  we  will  soon  have  an  0(n2)  method  for  computing  the  eigenvectors  of  a  symmetric 
tridiagonal  matrix  which  parallelizes  well  and  avoids  the  problems  with  computing  the 
eigenvectors  associated  with  clustered  eigenvalues.  ScaLAPACK  looks  forward  to  applying 
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these  new  techniques  in  a  future  release. 
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Chapter  3 

Basic  Linear  Algebra  Subroutines 

3.1  BLAS  design  and  implementation 

The  BLAS[117,  63,  62],  Basic  Linear  Algebra  Subroutines,  were  designed  to  allow 
portable  codes  most  of  whose  operations  are  matrix-matrix  multiplications,  matrix- vector 
multiplications,  and  related  linear  algebra  operations  to  achieve  high  performance  provided 
that  the  BLAS  achieve  high  performance.  In  LAPACK[4],  the  BLAS  were  used  to  re-express  the 
linear  algebra  algorithms  in  the  previous  libraries  Linpack[61]  and  EISPACK[153],  thereby 
achieving  performance  portability. 

The  BLAS  routines  are  split  into  three  sets.  BLAS  Level  1  routines  involve  only 
vectors,  require  0(n)  flops  (on  input  vectors  of  length  n)  and  two  or  three  memory  op¬ 
erations  for  every  two  flops  performed.  BLAS  Level  2  routines  involve  one  n  by  n  matrix, 
0(n2)  flops  and  one  or  two  memory  operations  for  every  two  flops  performed  (rectangu¬ 
lar  matrices  are  also  supported).  BLAS  Level  3  routines  involve  only  matrices,  0(n3)  flops 
and  0(n2)  memory  operations.  BLAS  Level  1,  because  they  involve  only  0(n)  operations 
per  invocation,  have  the  least  flexibility  in  how  the  operations  are  ordered,  and  require 
the  most  memory  operations  per  flop.  Hence,  BLAS  Level  1  routines  have  the  lowest  peak 
floating  point  operation  rate.  They  also  have  the  lowest  software  overhead  -  an  important 
consideration  because  they  perform  few  operations.  BLAS  Level  3  routines  have  the  most 
flexibility  in  how  the  operations  are  ordered  and  require  the  fewest  memory  operations  per 
flop  and  lienee  achieve  the  highest  performance  on  large  tasks.  BLAS  Level  1  and  2  routines 
are  typically  limited  by  the  speed  of  memory.  BLAS  Level  3  routines  typically  execute  very 
near  the  peak  speed  of  the  floating  point  unit. 


44 


Typical  hardware  architectures  make  it  possible,  but  not  easy,  to  achieve  high 
floating  point  execution  rates  for  matrix-matrix  multiply.  Floating  point  units  can  initiate 
floating  point  operations  every  2  to  5  nanoseconds  though  floating  point  operations  take  10 
to  30  nanoseconds  to  complete  and  main  memory  requires  20  to  60  nanoseconds  per  random 
data  fetch.  Floating  point  units  achieve  high  throughput  through  concurrency,  allowing 
multiple  operations  to  be  performed  simultaneously,  and  pipelining,  starting  operations 
before  the  previous  operation  is  complete.  Register  files  are  made  large  enough  to  provide 
source  and  target  registers  for  as  many  operations  as  can  be  active  at  one  time.  Main 
memory  throughput  can  be  enhanced  by  interleaving  memory  banks  and  by  fetching  several 
words  simultaneously  (or  nearly  so)  from  main  memory.  Memory  performance  is  further 
enhanced  by  the  use  of  caches.  Two  levels  of  caches  are  now  typical  and  systems  are  now 
being  designed  with  three  levels  of  caches. 

High  performance  BLAS  routines  typically  incur  significant  software  overhead:  be¬ 
cause  to  achieve  near  the  floating  point  unit’s  peak  performance,  BLAS  routines  need  an 
inner  loop  that  can  keep  the  floating  point  units  busy,  surrounded  by  one  or  more  levels 
of  blocking  to  keep  the  memory  accesses  in  the  fastest  memory  possible.  Managing  con¬ 
currency  and/or  pipelining  requires  a  long  inner  loop  which  operates  on  several  vectors  at 
once.  Each  level  of  blocking  requires  additional  control  code  and  separate  loops  to  handle 
portions  of  the  matrix  that  are  not  exact  multiples  of  the  block  size.  For  example,  DGEMV1 
(double  precision  matrix- vector  multiplication)  on  the  PARAGON  has  an  average  software 
overhead  of  23  microseconds  (over  1000  cycles  at  50  Mliz)  and  includes  200  instructions  of 
error  checking  and  case  selection,  750  instructions  for  the  transpose  case  and  500  for  the 
11011-transpose  case2. 

3.2  BLAS  execution  time 

The  execution  time  for  each  call  to  a  BLAS  routine  depends  upon  the  hardware, 
the  BLAS  implementation,  the  operation  requested  and  the  state  of  the  machine,  especially 
the  contents  of  the  caches,  at  the  time  of  the  call.  The  time  per  DGEMV,  or  BLAS  Level  2, 
flop  is  limited  by  the  speed  of  the  memory  hierarchy  level  at  which  the  matrix  resides.  The 

1 DGEMV  performs  y  =  aAx  +  fty  or  y  =  aAT x  +  fty,  where  A  is  a  matrix,  x  and  y  are  vectors  and  a  and 
ft  are  scalars. 

2  These  instruction  counts  include  all  instructions  routinely  executed  during  the  main  loop  in  reduction 
to  tridiagonal  form.  Not  all  are  executed  during  each  call  to  DGEMV. 
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Table  3.1:  BLAS  execution  time  (Time  =  <8;  +  number  of  flops  •  7;  in  microseconds) 
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time  per  DGEMM3,  or  BLAS  Level  3,  flop  is  typically  limited  primarily  by  the  rate  at  which 
the  floating  point  unit  can  issue  and  complete  instructions.  We  will  concentrate  on  DGEMM 
and  DGEMV  because  they  perform  most  of  the  flops  in  PDSYEVX. 

Table  3.1  shows  the  software  overhead  and  time  per  flop  for  the  BLAS  routines. 
These  times  are  based  on  independent  timings  with  code  cached  but  not  data  cached  using 
invocations  that  are  typical  for  PDSYEVX.  Recall  that  these  parameters  are  used  in  a  linear 
model  of  performance: 


8  +  number  of  flops  X  7  ( Line  1 ) 

I11  PDSYEVX  we  are  most  concerned  with  the  time  per  flop  for  Level  3  routines  and 
secondarily  concerned  with  the  time  per  flop  and  software  overhead  for  Level  2  routines. 
For  n=3840  and  p=64,  011  the  Paragon,  the  three  largest  components  attributable  to  the 
items  in  Table  3.1  are:  28%  of  the  PDSYEVX  execution  time  is  attributable  to  BLAS  Level 

"  DGEMM  performs  C  =  aAB  +  /3C  or  c  =  aATB  +  /3C,  where  A,  B  and  C  are  matrices,  and  a  and  [3  are 
scalars. 
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3  floating  point  execution  (not  including  software  overhead),  8%  is  attributable  to  BLAS 
Level  2  floating  point  execution  and  5%  is  attributable  to  BLAS  Level  2  software  overhead. 
(See  Chapter  5  for  details.)  The  fact  that  the  BLAS3  software  overhead  for  the  IBM  SP2  is 
listed  as  0  stems  from  the  fact  that  matrix-matrix  multiply  is  faster  for  small  problem  sizes 
because  they  lit  in  cache4. 


Figure  3.1:  Performance  of  DGEMV  on  the  Intel  PARAGON 
DGEMV  expected/actual  executed  time  on  XPS5 


Figure  3.1  shows  how  actual  DGEMV  performance  differs  from  predicted  performance 
Line  1  on  the  PARAGON..  Each  point  represents  the  time  required  for  a  call  to  DGEMV  with 
parameters  that  are  typical  of  calls  to  DGEMV  made  in  PDSYEVX  divided  by  the  time  predicted 
by  our  performance  model.  The  timings  are  made  by  an  independent  timer  as  described  in 
Section  3.3.  The  model  matches  quite  well  on  most  calls  to  DGEMV.  It  also  shows  a  modest, 
but  noticeable  difference  between  the  cost  when  data  is  cached  versus  when  it  is  not.  If 

4We  did  not  pursue  this  because  it  BLAS3  software  overhead  has  little  impact,  on  PDSYEVX  execution  time. 
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the  software  overhead  term  were  removed  (i.e  using  number  of  flops  X72  as  the  model)  the 
model  would  underestimate  execution  by  a  factor  of  two  hundred  or  more  on  small  problem 
sizes. 

Some  calls  to  DGEMV  require  much  less  time  than  expected,  as  little  as  1/9,  indi¬ 
cating  the  software  overhead  is  not  independent  of  the  type  of  call  made.  I11  particular, 
calls  which  involve  very  few  flops  can  vary  widely  in  their  execution  time  (for  the  predicted 
time).  However,  not  many  calls  differ  widely  in  their  execution  time  and  those  that  do 
require  few  flops  (lienee  little  execution  time)  and  the  fact  that  they  do  not  match  well 
does  not  significantly  affect  the  accuracy  of  my  performance  model  for  PDSYEVX  (given  in 
C'liapter  4)  and  lienee  I  did  not  study  them  further. 

Figure  3.2  shows  that  DGEMV  011  the  PARAGON  requires  10  to  50  microsends  longer  if 
the  code  is  not  cached  at  the  time  it  is  is  called.  The  additional  time  required  is  estimated 
by  subtracting  the  cost  of  running  DGEMV  alone  from  the  cost  of  running  DGEMV  followed 
by  16,384  no-ops5  while  accounting  for  the  execution  time  of  the  16,384  110-ops  themselves. 
The  extra  time  required  increases  as  the  number  of  flops  increases.  And  the  extra  time  is 
greater  when  the  data  is  not  cached  than  when  it  is  cached6.  It  is  not  surprising  that  the 
extra  time  required  when  the  code  is  not  cached  increases  as  the  number  of  flops  increases 
because  when  few  flops  are  involved,  the  code  does  not  execute  as  many  loops.  However 
it  is  surprising  that  the  code  cache  miss  cost  in  the  “Data  not  cached”  case  appears  to 
increase  almost  linearly  with  the  number  of  flops,  I  would  expect  to  see  something  closer 
to  a  step  function.  This  deserves  further  study  if  it  is  determined  that  code  cache  misses 
substantially  affect  execution  time. 

Figure  3.3  shows  that  the  extra  time  required  by  DGEMV  ranges  from  1.5%  (when 
DGEMV  performs  many  flops)  to  over  10%)  (when  DGEMV  performs  few  flops).  Only  calls  made 
to  DGEMV  with  parameters  that  are  typical  of  the  calls  commonly  made  by  PDSYEVX  are 
shown.  The  extra  time  required  when  code  is  not  cached  can  be  up  to  80%)  011  calls  made 
to  DGEMV  requiring  very  few  flops,  but  these  are  rare  in  PDSYEVX. 

5The  code  cache  holds  8,192  no-ops.  Hence,  16,384  guarantees  that  the  no-ops  are  not  in  cache,  making 
their  execution  time  independent  of  what  is  in  the  code  cache  at  the  time  the  16,384  no-ops  are  executed. 

6I  compare  the  execution  time  when  neither  code  nor  data  is  cached  to  the  execution  time  when  code  is 
cached  but  data  is  not  when  estimating  the  extra  time  required  when  data  is  not  cached. 
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Figure  3.2:  Additional  execution  time  required  for  DGEMV  when  tlie  code  cache  is  flushed 
between  each  call.  The  y-axis  shows  the  difference  between  the  time  required  for  a  run 
which  consists  of  one  loop  executing  16,384  no-ops  after  each  call  to  DGEMV  and  the  time 
required  for  a  run  which  includes  two  loops  one  executing  DGEMV  and  one  executing  16,384 
no-ops. 


x  1 Q-5  DGEMV  code  cache  miss  cost  on  XPS5 


3.3  Timing  methodology 

Each  routine  is  timed  with  several  sets  of  input  parameters.  To  time  a  routine 
with  a  given  set  of  input  parameters,  the  routine  is  run  three  times  and  the  time  from  the 
third  run  is  used.  Each  run  consists  of  calling  the  routine  to  be  timed  repeatedly  within  a 
loop.  The  first  run,  in  which  the  loop  is  run  only  once,  ensures  that  the  code  is  paged  in. 
The  second  run,  in  which  the  loop  is  run  just  long  enough  to  exceed  the  timer  resolution, 
provides  an  estimate  that  is  used  to  determine  how  many  times  to  run  the  third  run.  The 
third  run,  in  which  the  loop  is  run  for  approximately  one  second,  is  the  only  one  whose 
execution  time  is  recorded.  We  record  both  CPU  time  and  wall  clock  time.  These  plots  are 
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Figure  3.3:  Additional  execution  time  required  for  DGEMV  when  the  code  cache  is  flushed 
between  each  call  as  a  percentage  of  the  time  required  when  the  code  is  cached.  See 
Figure  3.2. 


DGEMV  code  cache  miss  cost  on  XPS5 


based  on  CP  FT  time. 

The  input  parameters  for  each  run  are  randomly  selected  such  that  they  match 
the  input  parameters  made  in  a  typical  call  to  DGEMV  from  PDSYEVX.  Randomly  selecting 
the  input  parameters  provides  advantages  over  a  systematic  choice  of  input  parameters. 
A  systematic  choice  of  input  parameters  might  include,  for  example,  only  even  values,  of  k 
whereas  odd  values  of  k  might  require  significantly  longer.  Random  selection  means  that  the 
likelihood  of  identifying  anomalous  behavior  is  directly  related  to  how  often  that  behavior 
occurs  in  calls  within  PDSYEVX.  Random  selection  scales  well  also:  It  is  easy  to  increase  or 
decrease  the  number  of  timings  and/or  the  number  of  processors  used. 
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3.4  The  cost  of  code  and  data  cache  misses  in  DGEMV 

Each  set  of  input  parameters  is  timed  under  four  different  cache  situations: 

Code  and  data  cached 
Code  cached  but  data  not  cached 
Code  not  cached  but  data  cached 
Neither  code  nor  data  cached 

Data  can  be  allowed  to  remain  in  cache  (to  the  extent  that  it  fits  in  cache)  by 
using  the  same  arrays  in  each  call  within  the  timing  loop.  Likewise  data  can  be  prevented 
from  remaining  in  cache  by  using  different  arrays  for  each  call  within  the  timing  loop. 

Allowing  data  to  reside  in  cache  reduces  execution  time  in  two  ways.  It  reduces 
the  cost  of  accessing  the  data  in  the  arrays  being  operated  on  and  it  reduces  the  software 
overhead  cost,  because  software  overhead  also  involves  reading  and  writing  data,  notably 
while  saving  and  restoring  registers. 

Code  and  data  cache  misses  are  more  important  in  DGEMV  than  in  DGEMM  because 
DGEMV  is  called  more  often  than  DGEMM  and  the  ratio  of  flops  to  data  movement  is  higher  for 
DGEMM  than  for  DGEMV,  lienee  reducing  the  cost  of  data  cache  misses  in  DGEMM. 

3.5  Miscellaneous  timing  details 

We  make  sure  that  timings  are  not  affected  by  conditions  which  are  not  likely  to 
be  encountered  in  a  typical  run  of  PDSYEVX.  Exceptional  numbers  (subnormalized  numbers 
and  infinities  )  will  occur  only  rarely  in  PDSYEVX'.  Hence,  we  make  sure  that  exceptional 
numbers  do  not  appear  during  our  timing  runs. 

We  do  not  time  PDSYEVX  on  problem  sizes  that  do  not  fit  in  physical  memory. 
Hence,  when  timing  the  individual  BLAS  routines,  we  make  sure  that  the  arrays  fit  in  physical 
memory.  Ed  D’Azevedo  has  written  an  out-of-core  symmetric  eigensolver  and  studied  the 
affect  of  paging  on  PDSYEVX[52], 

1  The  matrix  is  scaled  before  reduction  to  tridiagonal  form  to  avoid  being  close  to  the  overflow  or  underflow 
threshold.  Although  this  does  not  prevent  underflows  (or  subnormalized  numbers)  it  causes  them  to  be  rare. 
NaNs  will  never  appear  in  PDSYEVX  unless  NaNs  appear  in  the  input. 
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We  measure  and  report  both  wall  clock  time  and  CPU  time.  Wall  clock  time  may 
differ  from  CPU  time  for  several  reasons,  including:  time  spent  waiting  for  communication, 
time  spent  on  other  processes  and  time  spent  on  paging  and  other  operating  system  services. 
When  timing  the  BLAS,  we  are  primarily  interested  in  CPU  time  because  we  there  is  no 
communication  and  we  are  not  interested  in  measuring  the  time  spent  waiting  on  other 
processes.  However,  we  measure  and  report  wall  clock  time  because  for  all  other  timings 
we  must  rely  on  wall  clock  timings8.  When  the  wall  clock  time  differs  substantially  from 
the  CPU  time  on  calls  to  the  BLAS  on  time  shared  systems  (such  as  the  IBM  SP2)  we  use 
the  ratio  of  wall  clock  time  to  CPU  time  as  a  crude  measure  of  the  load  on  the  system. 

We  use  the  timing  routines  included  in  the  BLACS  routines  developed  at  Univeristy 
of  Tennessee  at  Knoxville[169,  69]  (which  are  not  a  part  of  the  BLACS  specification). 
Many  modern  computers  have  cycle  time  counters  which  would  allow  much  more  detailed 
measurement  of  execution  time  and  often  other  machine  characteristics.  These  detailed 
timing  routines  are  not  portable  and  I  chose  to  stick  to  portable  timing  techniques.  Alter¬ 
natively,  Ivrste  Asanovic  has  developed  a  portable  interface  for  taking  performance  related 
statistics  over  an  ” interval”  of  a  code’s  executionfll]. 


8CPU  time  is  often  meaningless  when  communication  is  involved. 
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Chapter  4 

Details  of  the  execution  time  of 
PDSYEVX 

4.1  High  level  overview  of  PDSYEVX  algorithm 

Figure  4.1  shows  liow  PDSYEVX  reduces  the  original  (dense)  matrix  to  tridiagonal 
form  (Line  1),  uses  bisection  and  inverse  iteration  to  solve  the  tri diagonal  eigenproblem 
( Line  2 )  and  then  transforms  the  eigenvectors  of  the  tridiagonal  matrix  back  into  the  eigen¬ 
vectors  of  the  original  dense  matrix(Line  3).  PDSYEVX  uses  a  two-dimensional  block  cyclic 
data  layout  with  an  algorithmic  block  size  equal  to  the  data  layout  block  size  in  both 
Householder  reduction  to  tridiagonal  form  and  back  transformation.  When  using  bisec¬ 
tion  to  compute  the  eigenvalues,  it  assigns  each  process  an  essentially  equal  number  of 
eigenvalues  to  compute.  For  inverse  iteration,  PDSYEVX  attempts  to  assign  roughly  equal 
numbers  of  eigenvectors  to  each  process  while  assigning  all  eigenvectors  corresponding  to 
a  given  cluster  of  eigenvalues  to  the  same  process.  Gram- Schmidt  re-orthogonalization  is 
performed  locally  within  each  process  and  lienee  orthogonality  is  not  guaranteed  for  eigen¬ 
vectors  corresponding  to  eigenvalues  within  a  cluster  that  is  too  large  to  fit  on  a  single 
process. 

We  assume  that  only  the  lower  triangle  of  the  square  symmetric  matrix  A  contains 
valid  data  on  input  and  the  algorithms  only  read  and  write  this  lower  triangle.  The  general 
conclusions  of  this  thesis  apply  to  the  upper  triangular  case  as  well. 

Please  refer  to  Table  A.l,  Table  A. 2,  and  Table  A  in  Appendix  Afor  the  list  of 
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Figure  4.1:  PDSYEVX  algorithm 


A  E  Rn  is  the  matrix  whose  eigendecomposition 

A  =  QTQt  ™.Setet,.  . 

1  is  tndiagonal. 

Q  is  orthogonal. 

A  =  diag(  Ai, . . .  ,  Xn)  is  the  diagonal  matrix  of 
eigenvalues. 

T  =  U A UT  The  columns  of  U  =  [ui  . . .  un]  are  the  eigenvec¬ 

tors  of  T. 

Tiii  —  A  i'Ui 


(Line  1) 


(Line  2) 


The  columns  of  V  =  [i’i  . . .  vn]  are  the  eigenvectors 
V  =  QU  of  A.  (Line  3) 

Avi  =  An¬ 


notation  used  in  this  chapter. 

Section  4.2  describes  and  models  reduction  to  tridiagonal  form  as  performed  by 
PDSYTRD.  Section  4.3  describes  and  models  the  tridiagonal  eigensolution  as  performed  by 
PDSTEBZ  (bisection)  and  PDSTEIN(inverse  iteration).  Section  4.4  describes  and  models  back 
transformation  as  performed  by  PDORMTR. 

4.2  Reduction  to  tridiagonal  form 

4.2.1  Householder’s  algorithm 

Figure  4.4  shows  Householder’s  reduction  to  tridiagonal  form,  Figure  4.4  shows  a 
model  for  the  runtime  of  ScaLAPACK’s  reduction  to  tridiagonal  form  code,  PDSYTRD.  The 
rest  of  this  section  explains  the  computation  and  communication  pattern  in  PDSYTRD.  We 
begin  by  describing  the  classical  (serial  and  unblocked)  algorithm  (essentially  the  EISPACK 
algorithm  TRED1  and  also  LAPACK’s  DSYTD2),  then  the  blocked  (but  still  serial)  algorithm 
(essentially  the  LAPACK  algorithm  DSYTRD)  and  finally  the  parallel  blocked  ScaLAPACK  algo¬ 
rithm  PDSYTRD. 
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Classical  (serial  and  unblocked)  Householder  reduction  (Figure  4.2) 

Figure  4.2  shows  the  algorithm  for  the  clasical  (serial  and  unblocked)  Householder 
reduction  to  tridiagonal  form,  (essentially  the  algorithm  used  in  LAPACK’s  DSYTD2. 

The  first  iteration  through  the  loop  performs  an  orthogonal  similarity  transforma¬ 
tion  of  the  form:  A  —  (I  —  Tvvt)A(I  —  twiA)  where  r  =  2/  ||  v-2  \\^  such  that  only  the  first 
two  elements  in  the  first  column  (and  lienee  the  first  two  elements  in  the  first  row)  of  A 
are  non-zero.  Each  iteration  through  the  loop  repeats  these  steps  on  the  trailing  submatrix 
A('2:n,'2:n)  to  reduce  A  to  tri diagonal  form  by  a  series  of  similarity  transformations. 

Compute  an  appropriate  reflector  (Line  2.1  in  Figure  4.2  ) 

We  seek  a  reflector  of  the  form:  I  —  tviA  such  that  r  =  -i-  and  the  first  row  and 

VlV 

column  of  (I  —  Tvvt)A(I  —  tv vf)  has  zeroes  in  all  entries  except  the  first  two. 

Let  s  be  the  column  vector  A(2:??.,l).  In  exact  arithmetic,  any  vector  v  =  c[^i±  || 
s  || 2,  z-2  . . .  zn\  for  any  scalar  c  will  suffice,  and  defines  wliat  value  r  must  take.  LAPACK 
and  ScaLAPACK  choose  the  sign  ( ±  ||  s  1 1 2 )  to  match  the  sign  of  z\  to  minimize  roundoff 
errors,  and  choose  c  such  that  r>(l)  =  1.0.  c  can  also  be  chosen  to  be  1,  avoiding  the 
need  to  multiply  s  by  c,  at  some  small  risk  of  over/underflow. 

Form  the  matrix  vector  product  y  =  Av  (Line  3.3  in  Figure  4.2  ) 

This  is  a  matrix  vector  multiply  (Basic  Linear  Algebra  Subroutines  Level  2)  requiring 
2 (n  —  i)2  flops,  which  when  summed  from  i  =  1  to  n  —  1  totals  |??3  flops. 

Compute  the  companion  update  vector  w  =  y  —  r(y)T  ■  v)v  (Line  5.1  in  Figure  4.2  ) 

The  vector  w  (which  is  computed  here  with  a  dot  product  and  a  DAXPY)  has  the 
property  that  (I  —  tvvt  )A(I  —  tvvt  )  =  A  —  vwT  —  wvT . 

Update  the  matrix  (Line  6.3  in  Figure  4.2  ) 

Compute  A  =  A  —  v  wT  —  wvT ,  a  BLAS  Level  2  rank-2  update.  A  rank- 2  update  requires 
4  flops  per  element  updated,  only  the  lower  triangular  portion  of  A  is  updated,  so  this 
requires  2 (n  —  i)2  flops,  which  summed  over  i  =  1  to  m  —  1  is  |n3  flops. 
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Figure  4.2:  Classical  unblocked,  serial  reduction  to  tridiagonal  form,  i.e.  EISPACK’s 
TREDl(Tlie  line  numbers  are  consistent  with  figures  4.3,  4.4  and  4.5.) 

do  /  =  1 ,  n 


Compute  reflector 


2.1  [r ,  u ]  =  liouse(  A(  i + 1 :  n ,  i ) ) 


Perform  matrix- vector  multiply 

w  =  tril(A(/-|-l:?i,  i+l:n))v 
+  tril(  A('/+l:n,  /  + 1 : ) ,  —l)vT 

Compute  companion  update  vector 

5.1  c  =  w  ■  vT; 

w  =  t  w  —  (cr/2)  v 

Perform  rank  2  update 

A(/-|-l:?i,  ■/ + 1 : 77 )  = 

6.3  t ril ( A{  i+l :  n,  ■/+ 1 : 77 ) 

—  WV  —  VW  ) 


v  G  Rn~l;  t  is  a  scalar; 

House  computes  a  householder  vector 
such  that 

(I  —  tvvt )A(i+l:n,  i)(I  —  tvvt) 
is  zero  except  for  the  top  element. 

w  G  i?n_i;tril()  is  MATLAB  notation 
for  the  lower  triangular  portion  of 
a  matrix  (including  the  diagonal). 
tril( ,  —  1 )  refers  to  the  portion  of  the 
matrix  below  the  diagonal. 


Here  we  use  tril  to  indicate  that  only 
the  lower  triangular  portion  of  A  need 
be  updated. 


end  do  t '  =  1 ,  n 


Blocked  Householder  reduction  to  tridiagonal  form  (Figure  4.3) 

In  the  above  algorithm,  nearly  all  the  flops  are  performed  in  the  product  y  =  Av, 
or  the  rank-2  update  A  —  vwT  —  wvT ,  both  of  which  are  BLAS  Level  2  operations.  Through 
blocking,  half  of  the  flops  can  be  executed  as  BLAS  3  flops  because  k  matrix  updates 
can  be  performed  as  one  rarik-2/,;  update  instead  of  k  rank-2  updates.  This  is  done  in 
Line  6.3  in  Figure  4.2.  The  cost  of  blocking  is  significant  in  PDSYTRD,  but  the  gain  is  also. 
See  section  7.2.2.  This  allows  the  matrix  update  to  be  considerably  more  efficient,  but  it 
complicates  the  computation  of  the  reflector  and  the  computation  of  the  companion  update 
vector,  because  PDSYTRD  must  work  with  an  out-of-date  matrix.  Starting  with  Ao,  the 
computation  of  the  first  reflector  i>o,  the  matrix  vector  product  and  wo  are  unchanged,  but 
as  soon  as  PDSYTRD  attempts  to  compute  the  second  reflector,  v\  it  has  to  deal  with  the  fact 
that  A\  is  known  only  in  factored  from,  i.e.  A\  =  Ao  —  vqWq  —  wqVq  .  This  does  not  greatly 
complicate  computing  the  reflector  because  the  reflector  needs  only  the  first  column  of  A\. 
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Figure  4.3:  Blocked,  serial  reduction  to  tridiagonal  form,  i.e.  DSYEVX(  See  Figure  4.2  for 
unblocked  serial  code) 

do  ii  =  1,  n,  nb 
mxi  =  min(H  +  nb,  n) 
do  i  =  ii,  mxi 

Update  current  ( ith )  column  of  A 

A(-,  i)  =  A(:,i)  - 

1.2  W(\,  ii:i— 1)  V(i,  ii:i— 1)T  — 

V( :,  ii'.i— 1 )  W(  i,  ii:i— 1  )T 

Compute  reflector 

2.1  [r,  v]  =  liouse(  H(i+l:n,  ■/))  v  £  Rn~l ;  r  is  a  scalar 

Perform  matrix- vector  multiply 

3.3  w  =  tril(A(i-|-l:n,  /  + 1 :  ) )  c?  w  £  Rn~i 

+  tril(  A('/+l:n,  i+1:??,),  —l)Tv 

Update  the  matrix- vector  product 

w  =  w  — 

4.1  W(:,  ii:i—  1)  U(i,  i+l:n)Tv  — 

V{ :,  —  1 )  W{  i,  i+l:n)Tv 

Compute  companion  update  vector 

5.1  c  =  w  ■  vT; 

w  =  t  w  —  (c  t / 2 )  v 

W(i+l:n,  i)  =  w, 

T "  ( /  + 1 :  ,  ■/)  =  v 
end  do  i  =  ii,  mxi 

Perform  rank  '2k  update 

A(  mxi+1:  n,  mxi+1:  n )  = 
t ril ( A(  mxi+1:  n,  mxi+1:  n )  — 

W(  mxi- 1-1 :  n,  ii :  mxi)V(  mxi- 1-1 :  n,  ii :  mxi)T  — 

V(  mxi- 1-1 :  n,  ii :  mxi)W(  mxi- 1-1 :  n,  ii :  mxi)T ) 
end  do  ii  =  1,  n,  nb 


However,  computing  w i  requires  the  computation  of  A\v,  lienee  we  must  either  update  the 
entire  matrix  A\,  returning  to  an  unblocked  code,  or  compute  y  =  (Ho  —  vqWq  —  wqVq  )v. 
Computing  the  reflectors  and  the  companion  update  vectors  now  requires  that  the  current 
column  be  updated  (Line  1.2  in  Figure  4.3).  The  matrix  vector  product  must  be  updated 
(Line  4.1  in  Figure  4.3). 
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4.2.2  PDSYTRD  implementation  (Figure  4.4) 

Figure  4.5  shows  Householder’s  reduction  to  tridiagonal  form  along  with  a  model 
for  the  runtime  of  each  step  in  ScaLAPACK’s  reduction  to  tridiagonal  form  code,  PDSYTRD. 
The  rest  of  this  section  explains  the  computation  and  communication  pattern  in  PDSYTRD, 
and  lienee  the  inefficiencies. 
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Figure  4.4:  PDSYEVX  reduction  to  tridiagonal  form  (  See  Figure  4.3  for  further  details) 


do  ii  =  1,  n,  nb 
mxi  =  min)  **  +  nb,  n) 
do  i  =  ii,  mxi 


Update  current  (ith)  column  of  A  (Table  4.1) 


spread  V(i,ii:  i  —  l)T  and  W(i,ii 
1)T  down 

A(i,  i)  =  ,li :.  /'!  - 

1.2  W{\,  ii:i— 1)  V(i,  ii:i— 1)T  — 

V( 1 )  W(  i,  ii:i— 1  )T 

Compute  reflector  (Table  4.2) 

2.1  [r ,  u ]  =  liouse(  A(  i + 1 :  n ,  i ) ) 


processor  owning  V(i,ii  :  i  —  1)  and 
W(  i,  ii :  i  —  1)  broadcasts  to  all  other 
processors  in  its  procesor  column. 

V  and  W  are  used  as  they  are  stored 
(no  data  movement  required) 

v  E  Rn~l;  t  is  a  scalar 


uq  is  distributed  like  row  A(i, :) 
id-2  is  distributed  like  column  A(:,i) 


Perform  matrix- vector  multiply  (Table  4.3) 

3.1  spread  v  across 
transpose  v,  spread  down 
(tq  =  tril(yf(*+l:?i,  i+l:?i))c; 

w2  =  tril( vl( /  + 1 : ,  i+1:??.),  —  l)Tv 
sum  w  row-wise 
sum  wT  column-wise 

w  is  distributed  like  column  A( : 
,i),hencew\  must  be  transposed. 

Update  the  matrix- vector  product  (Table  4.4) 

w  =  w  — 

4.1  W(:,  ii:i—  1)  U(i,  i+l-n)Tv  — 

V{ ii :i  —  1 )  W{  i,  i+l  :n)Tv 

Compute  companion  update  vector  (Table  4.5) 

5.1  c  =  w  ■  vT ; 

w  =  t  w  —  (c  t / 2 )  v 


3.2 

3.3 

3.4 

3.5 

3.6 


W  =  W\  +  W  2 


1U(  ?-|-1:?7,  ■/)  =  w, 
T "  ( /  + 1 :  ,  ■/)  =  v 
end  do  i  =  ii,  mxi 


Perform  rank  '2k  update  (Table  4.6) 


6.1 

6.2 

6.3 


spread  V  (  mxi + 1 :  n ,  i  i :  mxi ) , 

W(  mxi  +  1:  n,  ii :  mxi)  across 

transpose  V(  mxi  +  l:  n,  ii :  mxi), 

W(  mxi  +  1:  n,  ii :  mxi),  spread  down 
A(  mxi+1:  n,  mxi+1:  n )  = 
t ril ( A(  mxi+1:  n,  mxi+1:  n )  — 

W(  mxi  +  1:  n,  ii :  mxi)V(  mxi+1:  n,  ii :  mxi) 
V(  mxi  +  l:  n,  ii :  mxi)W(  mxi+1:  n,  ii :  mxi) 
end  do  ii  =  1,  n,  nb 


processors  in  current  column  of  pro¬ 
cessors  broadcasts  to  processors  in 
other  processor  columns 
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Figure  4.5:  Execution  time  model  for  PDSYEVX  reduction  to  tridiagonal  form  (See  Figure  4.4 
for  details  about  the  algorithm  and  indices.) 


comput  ation  communication 

overhead  imbalance  latency  bandwidth 

do  ii  =  1,  n,  nb 
mxi  =  min(+'  +  nb,  n) 
do  i  =  ii,  mxi 


Update  current  ( ith )  column  of  A 

1.1  spread  VT  and  WT  down 

1.2  A  =  A- W  VT  -  V  WT  2  n  S4 

Compute  reflector 

2.1  v  =  liouse(A)  nb  4 


2  n  lg(  /p)  a 

^72  2  n  lg(  y/p)  a 

3  n  lg(  y/p)  a 


Perform  matrix- vector  multiply 


3.1 

spread  v  across 

n  lg(  y/P)  (X 

1 

2 

+  lg(+p)  0 

y/P  ' 

3.2 

transpose  v,  spread  down 

n2  c 

n  lg(  y/P)  O' 

1 

2 

Wlg(v7)  0 

y/p  ' 

3.3 

w  =  tril(A)u; 

( nb4 

(if72+ 

wT  =  tril(  A,  —  l)vT 

+  nby?^) 

3^72) 

3.4 

sum  w  row-wise 

n  lg(  y/p)  O' 

r 

2 

+  Ig (y/p)  0 
y/P  ' 

3.5 

sum  wT  column-wise 

n  lg(  y/p)  (X 

r 

2 

+  Ig (y/P)  ,, 

V?  P 

3.6 

w  =  w  +  transpose  wT 

Update  the  matrix- vector  product 

4.1  w  =  w  —  W  VT v  —  V  WT v  4  n  b4  2  n-^=r  72  6  n  lg(  y/p)  a  n  ft 

Compute  companion  update  vector 

5.1  c  =  w  ■  vT;  n  64  2n  lg(  y/p)  a 

w  =  t  w  —  (c  t / 2 )  v 


end  do  i  =  ii,  mxi 


Perform  rank  '2k  update 

6.1  spread  V,  W  across 

6.2  transpose  V,W,  spread  down 

6.3  A  =  A  - W  VT  -  V  WT  2  -/£-=  b3 

nb  -y? 

end  do  ii  =  1,  n,  nb 


+  's (y/p)  o 
y? 

+  'styF)  g 

y/p 


1  2  n3  1  o  n 2  nb  \ 
(31773  +  3^73) 
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Distribution  of  data  and  computation  in  PDSYTRD 

In  PDSYEVX,  the  matrix  being  reduced,  A,  is  distributed  across  a  2  dimensional  grid 
of  processors.  The  computation  is  distributed  in  a  like  manner,  i.e.  computations  involving 
matrix  element  A(i,j)  are  performed  by  the  processor  which  owns  matrix  element  A(i,j). 
Vectors  are  distributed  across  the  processors  within  a  given  column  of  processors.  At  the 
ith  step,  i.e.  when  reducing  A(i:n,i:n)  to  A(  t  +  1 :  ??.,  t  +  1 :  ??.),  the  vectors  are  distributed 
amongst  the  processors  which  own  some  portion  of  the  vector  A(i:n,i).  Within  calls  to 
the  PBLAS,  these  vectors  are  sometimes  replicated  across  all  processor  columns,  or  even 
transposed  and  replicated  across  all  processor  rows.  However,  between  PBLAS  calls,  each 
vector  element  is  owned  by  just  one  processor. 

Critical  path  in  PDSYTRD 

For  steps  1.1,  1.2,  2.1,  4.1,  5.1,  6.1,  6.2,  6.3  in  Figure  4.5,  i.e.  all  steps  except 
“forming  the  matrix  vector  product”,  the  processor  owning  the  most  rows  in  the  current 
column  of  the  remaining  matrix  has  the  most  work  to  do  and  lienee  it  is  on  the  critical  path. 
When  the  matrix  vector  product  is  being  formed,  (steps  3.1  through  3.6)  the  processor 
which  owns  the  most  rows  and  the  most  columns  in  the  remaining  matrix  has  the  most 
work  (both  communication  and  computation)  and  lienee  is  on  the  critical  path. 

Load  imbalance 

Load  imbalance  occurs  when  some  processor)  s)  take  longer  to  perform  certain  op¬ 
erations1,  requiring  other  processors  to  wait.  Each  processor  is  responsible  for  computations 
on  the  portion  of  the  matrix  and/or  vectors  that  it  owns.  Some  processors  own  a  larger 
portion  of  the  matrix  and/or  vectors.  Since  PDSYTRD  has  regular  synchronization  points2, 
the  processor  which  takes  the  longest  to  complete  any  given  step  determines  the  execution 
time  for  that  step. 

If  row  j  is  the  first  row  in  a  data  layout  block,  the  processor  which  owns  A(j,  j )  will 
own  the  most  rows  in  A(j  :  n,j  :n):  [ Jnb1  J  nb  +  min(  n  —  j  +  1  —  nb  pr,  nb ).  However, 

if  row  j  is  not  the  first  row  in  a  data  layout  block,  even  this  formula  is  too  simplistic. 

1Load  imbalance  also  occurs  during  communication,  but  for  PDSYTRD  on  the  machines  that  we  studied 
the  communicaiton  load  imbalance  was  negligible. 

2 Computing  the  reflector  (Line  2.1)  and  computing  the  companion  update  vector  (Line  5.1)  require  all 
the  processors  in  the  processor  column  owning  column  i  of  the  matrix  and  are  hence  synchronization  points. 
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Fortunately,  ra~4+1  +  ^  is  an  excellent  approximation,  on  average,  for  the  maximum  number 
of  rows  of  A(j:n,j:n)  owned  by  any  processor.  ra~jJ+1  +  ^ Pr~1  is  more  accurate,  but  the 
difference  is  too  small  to  be  useful. 

The  second  source  of  load  imbalance  is  that  many  of  the  computations  are  per¬ 
formed  only  by  the  processors  which  own  the  current  column  of  the  matrix. 

Updating  the  current  column  of  A 

As  shown  in  table  4.1,  PDSYTRD  updates  the  current  column  of  A  through  two  calls 
to  PDGEMV,  one  at  line  350  of  pdlatrd.f  and  one  at  line  355  of  pdlatrd.f.  Each  of  these  calls 
to  PDGEMV  requires  that  the  first  few  elements  of  a  column  vector  (W  or  V)  be  transposed 
and  replicated  among  all  the  processors  in  that  column.  The  transposition  is  fast  because 
these  elements  are  entirely  contained  within  one  processor,  but  the  replication  requires  a 
spread  down  (column-wise  broadcast)  of  nb  or  fewer  items. 

Standard  data  layout  model 

By  making  a  few  assumptions,  we  can  significantly  simplify  the  model.  By  assum¬ 
ing  that  pr  =  pc  =  many  of  the  terms  coalesce.  We  also  assume  that  the  panel  blocking 
factor4  ,  pbf,=  2,  as  it  is  in  ScaLAPACK  1.5. 

This  standard  data  layout  is  also  assumed  in  Figure  4.5  and  in  C'liapter  5.  The 
models  used  in  Figure  4.5  and  in  C'liapter  5  are  subsets,  including  only  the  most  important 
terms,  of  the  “standard  data  layout”  models  shown  in  Tables  4.2  through  4.10. 

Computing  the  reflector  (Line  2.1  in  Figure  4.5) 

PDLARFG  computes  the  reflector  as  shown  in  table  4.2.  First,  it  broadcasts  cv  =  A(j  +  1  ,  j) 
to  all  processes  that  own  column  A(:,j).  Then,  it  computes  the  norm  j3  =  | A(j  +  l:n,j)\ 
leaving  the  result  replicated  across  all  processors  that  own  column  A(:,j). 

The  rest  of  the  computation  is  entirely  local  and  requires  only  +  0(n)  flops, 
lienee  does  not  contribute  significantly  to  total  execution  time. 

4The  matrix  vector  multiplies  are  each  performed  in  panels  of  size  pbfnb.  See  Section  4.2.2. 
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Table  4.1:  The  cost  of  updating  the  current  column  of  A  in  PDLATRD(Line  1.1  and  1.2  in 
Figure  4.5) 


Task 

Filedine  number 

or  subroutine 

Execution  time  con¬ 
tribution  from 
columns  j  =  1  to  n 
shown  explicitly 

Execution  time 
( simplified ) 

Broadcast  W(j,tj'-i)T 
within  current  column3. 

pdlatrd.f :  3  5  0 
p  dgemv _.c 
pbdgemv.f:  560 
dgebs2d 

n 

(  riog2(Pr)l  0  -1-54  + 

2  = 1 

J1  flog2(pr)l  /3) 

n\\og2(Pr)]cx+n  64  + 

0.5nnb  |’log2(pr)"|/3 

Compute  local  portion  of 

A(j-n)=A(j:n,j)-V  -1) 

xWU,ly'-l)T 

pdlatrd.f :  3  5  0 
p  dgemv _.c 
pbdgemv.f:  580 
dgemv 

n 

E(«2  +  2  “^72) 

2  =  1 

n52 +0.5  2^72 

Broadcast  V(j,iy’-i)T 
within  current  column. 

pdlatrd.f :  3  5  5 
p  dgemv _.c 
pbdgemv.f :  5  6  0 

n 

( b°s2  {pr  )i  0+54 

2=1 

3'  [log2(pr)l/3) 

n  [log2  (pr )]  oc  +n  64  + 

0.5  nnb  |’log2(pr)"|/3 

Compute  local  portion  of 
A(jn)=A(j:a,j)-W  -1) 

pdlatrd.f :  3  5  5 
p  dgemv _.c 
pbdgemv.f:  580 
dgemv 

n 

£(  52  +  2  <ii^>A2) 

2  =  1 

>4  5-2  0.5^72 

Total 

2  n  (log2(pr)lo'+n  nb  [log2  (pr )]  /3+2  n  S2+np"b  72+2  n  5 4 

Standard  data  layout 
(See  section  4.2.2) 

2  n  |"log2(^/p)]  Q'  +  ra  nb  (log2(^/p)]/3+2  n  S2  +  B-^jf-  72  +  2  n  54 
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Table  4.2:  The  cost  of  computing  the  reflector  (PDLARFG)  (Line  2.1  in  Figure  4.5) 


Task 

Filedine  number 

or  subroutine 

Execution  time  con¬ 
tribution  from 
columns  j  =  1  to  n 
shown  explicitly 

Execution  time 
( simplihed ) 

Q<  =  A(j  +  1,  j) 

pdlatrd.f :  3  6  4 
pdlarfg.f  :2 13 
dgebs2d 

n 

E  rioS2(?2r)l« 

j  =  l 

n["l°g2(Pr)l« 

xnorm  =  \A(j  +  1  :n,j)\ 

pdlatrd.f :  3  6  4 
pdlarfg.f  :229 
pdnrm‘2 

n 

E  (2  rios2  ip?  )i  <*+bsi ) 

j=i 

2  n  (log2(pr)l  Q'+f  Si 

t  =  -(a  +  /3)//3 

pdlatrd.f :  3  6  4 
pdlarfg.f  :271 

negligible 

negligible 

-4(i  +  2,i)  =  ^±|fl 

pdlatrd.f :  3  6  4 
pdlarfg.f  :‘2T2 
pdscal 

n 

E  1*4 

j=l 

n  c 

~bi 

E{j)  =  A(j  +  1  ,j)  =  /3 

pdlatrd.f :  3  6  4 
pdlarfg.f  :273 

negligible 

negligible 

Total 

3 n  flog 2  (pr )~|  a  +  n  b4 

Standard  data  layout 
(See  section  4.2.2) 

3 n  flog 2 ( y/p)]  a  +  n  S4 
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Forming  the  matrix  vector  product  using  PDSYMV(Lines  3.1  through  3.6  in  Fig¬ 
ure  4.5) 

The  matrix  A  is  laid  out  in  a  block  cyclic  manner  as  described  in  section  2.5.4. 
Computing  the  matrix  vector  product  y  =  Av  requires  that  v  be  copied  to  all  processes 
that  own  a  part  of  A  that  needs  to  be  multiplied  by  v.  The  vector  v  must  be  transposed5. 
Each  element  is  sent  directly  from  the  processor  (in  the  processor  column)  Each  processor 
in  the  processor  column  that  owns  v  sends  to  each  processor  in  the  processor  row  v  exactly 
the  elements  and  spread  down  and  because  only  half  of  A  is  stored,  v  must  also  be  spread 
across.  Then,  the  matrix  vector  multiplies6,  ttq  =  tril(A,0)t>  and  ttq  =  tril(A,  —  1)T v  are 
performed  locally,  ttq  is  summed  within  columns,  transposed  and  added  to  the  result  of 
tril(vl,  0)Tr>  which  is  summed  to  the  active  column  of  processors.  The  algorithm  used  by 
PDSYMV  is: 

Algorithm  4.1  PDSYMV  as  used  to  compute  Av 

1  Broadcast  v  within  each  row  of  processors  ( Line  3.1  in  Figure  4-4) 

2  Transpose  v  within  each  column  of  processors  ( Line  3.2  in  Figure  4-4) 

3  Broadcast  vT  within  each  column  of  processors  (Line  3.2  in  Figure  4-4) 

4  Form  diagonal  portion  of  A  (Line  3.3  in  Figure  4-4) 

5  ttq  =  locally  available  portion  o/tril(A,  0)r>  (Line  3.3  in  Figure  4-4) 

6  w-2  =  t>rtril(A,  —  1)  (Line  3.3  in  Figure  4-4) 

7  Sum  ttq  within  each  column  of  processors  (Line  3. 4  in  Figure  4-4) 

8  Sum  W2  within  each  row  of  processors  (Line  3.5  in  Figure  4-4) 

9  Transpose  ttq  and  add  to  ttq  (Line  3.6  in  Figure  4-4) 

The  two  transpose  operations,  steps  {2,3}  and  step  9  in  algorithm  4.1  though  both 

are  performed  by  PBDTRNV,  use  different  communication  patterns.  The  transpose  performed 

in  steps  2  and  3,  is  an  all- to- all.  It  takes  v  replicated  across  the  processor  columns  and 

distributed  across  the  processor  rows  and  produces  vT  replicated  across  the  processor  rows 

and  distributed  across  the  processor  columns.  The  transpose  performed  in  step  9  is  a  one- 

to-one  transpose.  It  takes  t/J  distributed  across  the  processor  columns  within  one  processor 

5The  non-transposed  v  is  distributed  like  column  A(:,i),  the  transposed  v  is  distributed  like  row  A(i,\). 

6 1 r il ( )  is  MATLAB  notation  for  the  lower  triangular  portion  of  a  matrix  (including  the  diagonal).  tril(,  — 1) 
refers  to  the  portion  of  the  matrix  below  the  diagonal. 
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row.  It  produces  yu  distributed  across  the  processor  rows  within  the  current  processor 
column. 

The  all- to- all  transposition  is  performed  in  two  steps  (steps  2  and  3  in  algo¬ 
rithm  4.1).  Since  each  column  of  processors  contains  a  complete  copy  of  the  vector  v, 
each  acts  independently,  first  collecting  the  portion  of  vT  that  belongs  to  this  processor  col¬ 
umn  to  one  processor'  and  then  broadcasting  it  to  all  processor  columns.  The  operation  of 
collecting  the  portion  of  vT  that  belongs  to  this  processor  column  to  one  processor  is  done  as 
a  tree-based  reduction,  requiring  |4og2(lcm(pr,pc))]  messages,  and  a  total  of 
words  which  I  model  as  words.  The  broadcast  which  completes  the  transpose  (step  3), 
requires  |dog2(pc)]  messages  and  |~log2(pc)]  4-  words. 

The  one-to-one  transpose  (step  9)  is  accomplished  as  a  single  set  of  direct  messages. 
Every  word  in  is  owned  by  exactly  one  processor.  Every  word  in  yu  should  be  sent  to  one 
processor.  Every  word  in  j/J  is  sent  from  the  processor  that  owns  it  to  the  processor  that 
needs  the  corresponding  word  in  yu.  All  words  being  sent  between  the  same  two  processors 
are  sent  in  a  single  message.  The  number  of  words  sent  by  each  processor  that  owns  a  part 
of  j/J  sends  every  word  that  it  owns,  i.e.  4-  jn  lcm (pr,Pc)/Pc  messages.  Every  processor  that 
needs  a  part  of  yu  receives  the  number  of  words  that  it  needs:  4-  jn  lcm (pr,Pc)  messages. 

The  two  matrix  vectors  multiplies  are  each  performed  in  panels  of  size:  pbfnb. 
pbf,  the  panel  blocking  factor,  is  set  to  max(  mullen, lcm(pr,pc)/pc),  where  mullen  is  a  tuning 
parameter  set  at  compile-time  to  2  in  ScaLAPACK  1.5. 

The  cost  of  the  matrix  vector  multiply  is  detailed  in  table  4.3. 

The  number  of  flops  in  the  matrix  vector  multiply  which  any  given  processor  must 
perform  is  controlled  by  the  size  and  shape  of  the  local  portion  of  the  trailing  matrix. 
The  processor  holding  the  largest  portion  of  the  trailing  matrix  holds  a  matrix  of  size 
approximately* * 8  |~^b~p  1  mbx  \  1  n^-  Because  we  update  only  the  lower  triangular  portion 

of  the  matrix,  each  element  in  the  lower  triangular  portion  of  the  matrix  is  used  in  two  matrix 
vector  multiplies.  And,  because  the  shape  of  the  local  portion  of  the  matrix  is  irregular 
(a  column  block  stair  step  with  some  diagonal  steps)  the  matrix  vector  computation  is 
performed  by  column  blocks.  The  irregular  patterns  repeats  every  lcmbv’Pc)  nb,  so  pbf,  the 
panel  blocking  factor  is  chosen  to  be:  max(  mullen,  where  mullen  is  a  compile  time 

‘If  pr  =  lcmfpcpr)  the  portion  of  data  that  belongs  othis  processor  column  is  already  on  one  processor 

and  hence  this  “collection”  is  a  null  operation. 

8 The  largest  local  matrix  size  differs  from  this  only  when  mod  (  n  —  j ,  nbpr)  <  nb  or  mod  ( n  —  j ,  nbpc)  < 

nb. 
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Table  4.3:  The  cost  of  all  calls  to  PDSYMV  from  PDSYTRD 


Broadcast  v  within 
each  processor  row 
(Line  3.1) 

pdlatrd.f :  3  7  0 
pdsymv_.c 
pbdsymv.f  AOS 
dgebr'2d 

n 

E  ( rio§2(pc)i  q+ 

9=1 

ri°g2(Pe)K^  +  f)/3) 

n  (l°g2(Pc)lQ  + 

05  A  Oog2(Pc)i 

pr  ' 

0.5  n  nb  [log2  (pc )] /3 

Transpose  v 
(Line  3.2) 

pdlatrd.f :  3  7  0 
pdsymv_.c 
pbdsymv.f  A21 
pbdtrnv.f  :385 
pbdtrget, 

n 

E(n°g2(lcm(^-Pc))i« 

2  =  1 

+  ^/3) 

Pc  '  ’ 

n  (log2  (lcm(pr  ,Pc )  )1  a 
+0.5  ZL/3 

Pc 

Broadcast  vT  down 

within  each  processor 
column 
(Line  3.2) 

pdlatrd.f :  3  7  0 
pdsymv_.c 
pbdsymv.f  A21 
pbdtrnv.f  A00 
dgebs2d 

n 

E  ( riog2  (pt  )i «+ 

9=1 

riog2(PAK^+^)/A 

n  fl°g2(Pr)l«  + 

o.5  "2  ri°s2(p-n  /3+ 

0.5nnb  [log2(pr)"|/3 

Torm  diagonal  portion 
of  matrix,  padded 
with  zeroes 
(Line  3.3) 

pdlatrd.f :  3  7  0 
pdsymv_.c 
pbdsymv.f:  685 
pbdlacpl 

n 

T(h-lSi2-lSi) 

t—1  '  Pc  Pc  ’ 

9  =  1 

1  V?"  C  1  1  v?"  C 

2— *1  +  2— *1 

w  =  tril(  A,  0)  e, 
wT  =  vT  t ril ( A,  —1) 
local  computation 
(Line  3.3) 

pdlatrd.f :  3  7  0 
pdsymv_.c 
pbdsymv.f  :7 02, 
704,757,759 
dgemv 

n 

2  E(27&*2  + 

9  =  1 

(r^lnb)(r^lnb)72 

+9^72) 

2  pbf  nb  pc  *'2  + 

2  n'3  1  n2  nb  . 

3-72  +  2^-72  + 

1  nb  „  i  r?  nb  pbf  ^ 

2  Pc  72  +  ^^  72 

Sum,  row-wise,  w 

(Line  3.4) 

pdlatrd.f  :364 
pdsymv_.c 
pbdsymv.f:  801 
dgsum2d 

n 

E  ( riog2(pc)i «+ 

9  =  1 

riog2(Pc)K^+f)/3) 

n  fl°g2(Pc)lQ'  + 

0  5  A  (log2  ( Pc  )13 

Pr  ' 

0..5nnb  [log2  (pc )] /3 

Sum,  columnwise,  wT 
(Line  3.5) 

pdlatrd.f  :364 
pdsymv-.c 
pbdsymv.f:  809 
dgsum2d 

n 

E  ( ri°g2  (pr  )i «+ 

9  =  1 

ri°g2(pAK^+^)/A 

n  fl°g2(Pr)l«  + 

0.5  "2  ^2JPrni3  + 

0.5  nnb  |"log2(pr)]  (3 

Transpose  wT 
and  sum  into  w 
(Line  3.6) 

pdlatrd.f :  3  7  0 
pdsymv-.c 
pbdsymv.f:  811 
pbdtrnv 

y  n(lcMpr'Pc)a+ 
i= 1 

lcm(  pr  .Pc  )  Q,+  n  fj+  n  fj  + 

Pc  Pc'  Pr  1 

-y—Si  +  ^—h) 

n  b  pc  n  b  pr  x  / 

n  lcm(  pr  ,Pc  )  ..  , 

Pr  ' 

n  1cm (pr,Pc)  ...  , 

Pc  ' 

0.5  — /3+0. 5-/3+ 

pc  '  pr  ' 

0..5-^-5i+0.. 5-J^-5i 
n  b  pc  n  b  pr 

Total 

2  n  (log2  (pc  )1  ct  +  2n  |~log2  ( pr )]  cv  +  ra  lcm(+^ ,Pc  1  cv+ralcm(+^ ,Pc  1  cv+ 

"  flog2(lcm(Pr,pe))lo+"2  r^2(Pr)1/3+^/3+0  5^/3+ 

nnb  |"log2  (pc)l  (3+n  nb  \\og2(pr)]{3+2  pbf"bpc  62  + 

2  n°  A  1  1  n ^  nb  ^  ,  1  n?"  nb  ^  .  r?  nb  pbf  .  n?"  c  ,  ^  r 

3  p  ^  +  2  pr  ^+2  pc  72  +  pr  72+  Pc  A *4 

Standard  data  layout 
(See  section  4.2.2) 

4 n  riog2(Vp)l  a  +  2raa  +  2  n  /3+ 

1.5’^l3+2nnb  flog2 ( rfpf\ 4+  n ^<$2  + 

2  n°  ^  i  q  7+  nb  .  r?  c  ,  ^  c 
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parameter,  set  to  2  in  the  standard  PBLAS  release.  The  column  panels  are  tilled  out  with 
zeroes  to  make  the  matrix  vector  multiply  efficient.  Even  the  act  of  filling  the  diagonal 
blocks  with  zeroes,  because  it  is  done  inefficiently,  is  noticeable  on  modest  problem  sizes. 

The  number  of  flops  required  for  a  global  (n  —  j)  X  ( n  —  j )  matrix  vector  multiply 
is  approximately: 


1  r  n  —  j  ,  r  n  —  j . 
2  X  2  X  (-(  r— -^-1  nb )( r— — 
z  nbpr  nbpc 


nb)  +  (n —  j) 


pbf  nb 

2  pr 


The  first  2  is  because  multiplies  and  adds  are  counted  separately.  Each  element  in  the  lower 
triangular  portion  of  the  matrix  is  involved  twice,  lienee  the  second  2.  The  first  term  stems 
directly  from  the  size  of  the  local  matrix.  The  second  term  stems  from  the  odd  shape  of 
the  local  matrix  and  is  primarily  the  result  of  the  unnecessary  flops  (zero  matrix  elements) 
added  to  reduce  the  number  of  dgemv  calls. 

We  use  the  following  equality,  dropping  the  0(n)  term: 

n  q  o  o  7 

Erin  rin  n  n  a  no  ^ ,  . 

yhl  =  T  +  -r  +  -r  +  0("'- 


8  =  1 


flops 


En  ( 1  r n  —  7.,  ,  rn  —  ?’  ,  pbf  nb 

"b  rTr/l"b  +  / 


2X2- 
2  77 3 

O - 1" 

6  p 


1=1 
1 


2  '  nb]ir  nb|ic 

3  1  77 2 


77' 


nb^ 


1  77 2 


3  pr  pc  I  nb  j),  4  n b  p, 

?72  nb  ?7 2  nb  ?72  nb  pbf 


2  pr 

nb2 


?7 2  pbf  nb 

2  pr 


'2pr 


2  Pc 


Pr 


Figure  4.6:  Flops  in  the  critical  path  during  the  matrix  vector  multiply 


Updating  the  matrix  vector  product 

Updating  the  matrix  vector  product,  y  =  y  —  VWT v  —  WVT v,  requires  four  matrix  vector 
multiplies,  temp  =  WT v  and  temp  =  VT v  are  both  (?7  —  j)  X  j '  by  ?7  —  j  matrix  vector 
multiplies,  where  j'  =  j  mod  nb  .  Both  the  matrix  and  the  vector  are  stored  in  the  current 
process  column.  No  data  movement  is  required  to  perform  the  computation,  however  the 
result,  a  vector  of  length  j'  —  1  is  the  sum  of  the  matrix  vector  multiplies  performed  on  each 
of  the  processes  in  the  process  column. 
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Table  4.4:  The  cost  of  updating  the  matrix  vector  product  in  PDLATRD(  Line  4.1  in  Figure  4.5 ) 


Task 

Filedine  number 
or  subroutine 

Execution  time  con¬ 
tribution  from 
columns  j  =  1  to  n 
shown  explicitly 

Execution  time 
( simplified ) 

Broadcast  WT 
unnecessarily 
for  temp  =  I  f'7  r 

pdlatrd.f :  3  7  3 
pbdgemv.f:  826 
dgebs2d 

n 

Y  (  x4  +  flog2  (/JC  )1  o  + 

3  = 1 

riog2(pc)i ( Prj)4+ 
rio§2  (Pc)ifi3) 

nSi  +  n  flog2(pc)l  Q  + 

0  r  ’)2  0°g2(Pc)l  ,o 
pr  1 

+0.5)1  nb  |4og2(pc)14 

Local  computation 
of  temp  =  WT v 

pdlatrd.f :  3  7  3 
pbdgemv.f:  846 
dgemv 

n 

E(«2  +  2£f72) 

j=i 

nS2  +0.5  ^72 

Sum  the  contribution 
of  temp  from  all  processes 
in  the  column 

pdlatrd.f :  3  7  3 
pbdgemv.f:  858 
dgsum‘2d 

n 

Y  ( riog2(in)i  «+ 

j=i 

flog2(Pr)lf)/3) 

n  [t°g2(Pr)l«  + 

0.5  nnb  |’log2(pr)"|/? 

Broadcast  temp 

(row- wise)  to  ah  processes 

in  this  column 

pdlatrd.f :  3  7  6 
pbdgemv.f :  5  7  9 
dgebs2d 

n 

Y  ( 54+riog2(;2r)]Q  + 

3  =  1 

ri°g2(in)l^/3) 

n  64 +n  [log2(pr)]of  + 

0.5  nnb  |’log2(pr)"|/3 

Local  computation 

of  y  =  V  ■  temp 

pdlatrd.f :  3  7  6 
pbdgemv.f:  600 
dgemv 

n 

E(«2  +  2^f72) 

j=l 

n  52+0.5^72 

y  =  y  +  WVT  is  identical 
to  y  =  y  +  VWT 

pdlatrd.f  :379 
pdlatrd.f  :382 

n  flog2  (pc  )1  Q  +  2  n  (log2(pr)l  cv+0.5  ^  y  *  >_l  4+ 

0.5ra  nb  |"log2  (pc)l  4+d  nb  flog2  (pr  f\j3+2n  52  + 

id  nb  .  o  c 

-^-72+2)i54 

Total 

2 n  (log2  {pc  )1  Q'+4ii  (l°g2  (pr  )1  Q  +  ^  Pc  ^  4+ 

n  nb  (log2(pc)l/3+2ranb  flog2  (pr  )1  /3+4n  S2+2  72 +4ra  54 

Standard  data  layout 
(See  section  4.2.2) 

e n  riog2(vF)i«+”2 

3ranb  [log2  (v?)l  (3+ An  S2  +  2  72+4)i  <$4 

The  other  two  matrix  vector  multiplies,  y  =  V  ■  temp  and  y  =  W  ■  temp,  are  both 
(n  —  j)  X  (j1  —  1)  by  j'  —  1  matrix  vector  multiplies.  Again,  the  computation  is  performed 
entirely  within  the  current  process  column.  The  1  by  j'  —  1  vector,  temp,  must  be  spread 
down,  i.e.  broadcast  column- wise,  to  all  processes  in  this  process  column,  however  no  further 
communication  is  necessary  in  order  to  update  y,  as  y  is  perfectly  aligned  with  V. 

Details  are  given  in  the  table  4.4. 

Computing  the  companion  update  vector 

The  details  involved  in  computing  the  companion  update  vector  are  shown  in  table  4.5. 
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Table  4.5:  The  cost  of  computing  the  companion  update  vector  in  PDLATRD  (Line  5.1  in 
Figure  4.5) 


Task 

Filedine  number 
or  subroutine 

Execution  time  con¬ 
tribution  from 
columns  j  =  1  to  n 
shown  explicitly 

Execution  time 
( simplified ) 

Compute  y  =  t  y 

pdlatrd.f  :385 
pdscal 

n 

E  5*4 

8  =  1 

^  Tl  64 

Compute  a  =  —0.5 t  yT v 

pdlatrd.f  :386 
pddot 

n 

E  (riog2(jv)i«'+§54) 

8  =  1 

n  [t°S2  (Pr  )1  o  +  |-ra54 

Compute  w  =  y  —  av 

pdlatrd.f  :390 
pdaxpy 

n 

E  (fl°S2(Pr)la+ifi4) 

8  =  1 

n  0°g2(lJr)l  Q'+|n  *4 

Total 

2  n  [dog2(pr  )~|  O'  +  n  64 

Standard  data  layout 
(See  section  4.2.2) 

2n  flog2(  v/p)]  ot  +  nS4 

Performing  the  rank  2k  update 

The  rank  2-k  update  is  performed  once  per  block  column  (i.e.  n/ nb  times): 

A  =  A  —  vwT  —  w vT  . 


PDSYTRD  broadcasts  v  and  w  along  processor  tows,  transposes  them  and  then 
broadcasts  them  along  processor  columns.  I  ignore  the  a  (latency)  cost  of  the  transpose 
here,  because  it  is  less  significant  (by  a  factor  of  nb )  than  the  similar  cost  for  the  transpose  in 
the  matrix- vector  multiply  and  because  it  is  only  relevant  when  lcm^r,Pc^  is  very  large.  The 

lcm  (pr,pc) 

third  j3  term  in  the  transpose  and  broadcast  operation  should  be  multiplied  by  \rZ(Pr,Pc ) 

Pc 

but  the  added  complexity  is  not  justified  for  a  small  term. 

The  number  of  flops  performed  during  the  rank  two  update  of  A(j  :  n,j  :  n)  is 
modeled  as: 


,  1  n  —  j  nb  n  —  j 

2  x  2  x  nb  ( — ( - -  H - )( - - 

2  pr  ^  2  pc 


nb 

Y 


n nb  pbf 

2  pr 


The  number  of  flops  performed  per  matrix  element  involved  in  the  rank-2  update  is  2  X  2  X  nb . 
The  number  of  elements  in  the  lower  triangular  matrix  is  given  by  the  sum  of  the  terms 
within  the  parentheses. 

The  total  number  of  flops  for  all  rank  two  updates  is  modeled  as  the  sum  of  this 
quantity  as  j  ranges  from  nb  to  n  by  nb. 
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Table  4.6:  The  cost  of  performing  the  rank-2/,;  update  (PDSYR2K)  (Lines  6.1  through  6.3  in 
Figure  4.5) 


Task 

Filedine  number 
or  subroutine 

Execution  time  con¬ 
tribution  from 
columns  j  =  1  to  n 
shown  explicitly 

Execution  time 
( simplihed ) 

Broadcast  V  and  W 
within  process  rows 
(Line  6.1) 

pdsytrd.f  :354 
pdsyr2k_.c 
pdsyr2k.f  :454,477 
dgebs2d 

n 

E  (2  riog2(iv)i«+ 

j=nb,nb 

2(^  +  f)\loS2(Pc)]l3) 

2 77  d°S2(Pc)l  a 
+  £  flog2(Pc)14 

-n  nb |"log2  (pc)l 4 

Transpose  and 
broadcast  V  and  W 
within  process  columns 
(Line  6.2) 

pdsytrd.f  :354 
pdsyr2k_.c 
pdsyr2k.f  :491,847 
pbdtran 

n 

E  (  2  flog2  (  j- )  i  Q  -f- 

j=nb,nb 

2(^  +  ^)nbflog2(Pr)14 
+  ^3) 

2779!loS2(lu)l  a 
+  fy  [log2(Pr)14 

-n  nb  |"log2  (pr )]  4 

+  — 4 

tril(  A,  0)  =  t ril ( A,  0) 

+v  ■  WT  +  w  ■  VT 

(Line  6.3) 

pdsytrd.f  :354 
pdsyr2k_.c 
pdsyr2k.f :655— 60  , 
1052-57 
pdgemm 

n 

E  (4„bPcpbf*3+4nb 

j=nb,nb 

+  ^)73,) 

2  ,  2  hf  &3  + 

nb^  pc  pbt 

2  1  n2  nb  . 

3^  +  2^“^  + 

1  r?  nb  ^  i  r?  nb  pbf  ^ 

2  Pc  '*>  +  — 5T-73 

Total 

2t¥  rio§2(pc)i  q+ 2^  riog2(prn  Q+^  fiog2(pc)i/?+2i  rioS2(pr)i  4+^/3 
nnb  (log2  (pc  )1 4  n  nb  [log2  (pr  )1 4+2  nb2  ^  pbf  S3  +  ”  73  +  2  "prnb73 
+  1^73+  2^73 

Standard  data  layout 
(See  section  4.2.2) 

477  riog2(y?)l  «+2^  riog2(y?)14+^4 

2nnb  [l°g2( \/p)14+ nb2  ^<$3+3  73 +3  ^  73 

The  negative  term  ( —  2n  ^nb 7.3 ) ,  which  results  from  the  fact  that  j  starts  at  nb,  is 
ignored  because  it  is  0(  n  pnb  )  and  lienee  too  small. 

Details  are  given  in  table  4.6. 


4.2.3  PDSYTRD  execution  time  summary 

Table  4.7  shows  that  the  computation  cost  in  PDSYTRD  is: 
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2  ??d 
3  p 


73 


2  ??d 
3  p 


72 


??2  nb  pbf  7  n2  nb  1  ??2  nb 

- 72  +  x  - 72  +  x  - 72  1 - 

Pr  2  pr  2  pc  p, 

n2  nb  pbf  1  ??2  nb  1  ??2  nb 


n2  nb  pbf 


■72 


Pr 


■  73  +  X  - 

2  pr 


■  73  +  X  - 

2  pc 


n 


nb 2  pc  pbf 


S3  +  6  n  ft  2  +  2 


■73 


- rr-r  t-2  H - <%  +  9  n  ft4  . 

pr  pbf  nb  pc 


3 

The  most  important  terms  in  the  computation  cost  are  the  0(^-)  flops.  The 
relative  importance  of  the  other  (o(n3))  terms  depends  on  the  computer.  On  the  PARAGON 
none  stand  out  above  the  rest.  Indeed  on  the  PARAGON  none  of  the  o(n3)  terms  accounts 
for  more  than  3%  of  the  total  execution  time  of  PDSYEVX  when  ??  =  3480  and  p  =  64. 
However,  all  of  o(n3)  terms  combined  account  for  21%  of  the  total  execution  time  on  that 
same  problem. 

Figure  4.8  shows  that  the  computation  cost  in  the  tridiagonal  eigendecomposition 

in  PDSYEVXis: 


HU  l  IIK2  mil  n 

53 — 7^  +  3 - 7^  +  112??7^  +  265 — 71  +  45 - 71  +  620??.7i  +  6nc27i  . 

p  '  p  '  p  p 

The  execution  time  of  tridiagonal  eigendecomposition  is  dominated  by  the  cost 
of  divides,  and  the  size  of  the  largest  cluster,  c.  The  load  imbalance  terms  (112??7_^  and 
620??7i  )  are  neglible. 

Table  4.9  shows  that  the  communication  cost  in  PDSYTRD  is: 


4??.  flog2(pc)]a  +  13??.  flog2(pr)]a  +  ??  1cm (pr,pc)/pra  + 

??  |~log2(lcm(pr, pc ) )] a  + 

77^  77^  1  77  2  772 

3  -  riog2(Pc)l%  +  2  -  riog2(P?.)l/%  +  --I3  +  2  -13  . 

Pr  pc  2  pr  pc 

Most  of  the  messages  are  in  broadcasts  and  reductions  (i.e.  the  O(??log(p)) 
terms)  and  most  of  the  broadcasts  and  reductions  (13??)  are  within  processor  rows,  ver¬ 
sus  only  4??  broadcasts  and  reductions  within  processor  columns.  By  contrast,  the  message 
volume  is  fairly  evenly  split  between  broadcasts  and  reductions  within  processor  rows  ( 
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2  2 

3  y-  [4og2(pc)~|/4  )  and  broadcasts  and  reductions  within  processor  columns  (  2  y-  [4og2(pr  )~|/5 

)• 

The  1cm  terms  are  negligible  unless  p  is  very  large,  in  which  case  it  is  important 
to  make  sure  that  lcm(pr,pc)  is  reasonable  (say  <  10 ma x(pr,pc)). 


4.3  Eigendecomposition  of  the  tridiagonal 

The  execution  time  of  tri diagonal  eigendecomposition  is  dominated  by  two  factors: 
the  size  of  the  largest  cluster  of  eigenvalues  and  the  speed  of  the  divide. 

4.3.1  Bisection 

During  bisection,  in  DSTEBZ,  each  Sturm  count  requires  n  divisions  and  5 n  other 
flops  to  produce  one  additional  bit  of  accuracy.  Hence,  it  takes  roughly  53  n  divisions  and 
53  X  5  n  flops9  for  each  eigenvalue  and  53  n  e  total  divisions  for  all  eigenvalues  in  IEEE 
double  precision,  where  e  is  the  number  of  eigenvalues  to  be  computed.  The  exact  number 
of  divisions  and  flops  depends  on  the  actual  eigenvalues,  the  parallelization  strategy  and 
other  factors.  However,  this  simple  model  suffices  for  our  purposes. 


4.3.2  Inverse  iteration 


Inverse  iteration  typically  requires  3 n  divides  and  45??.  flops  per  eigenvalue  plus 
the  cost  of  re-orthogonalization. 

In  PDSYEVX  the  number  of  flops  performed  by  any  particular  processor,  p;,  during 
re-orthogonalization  is:  Ece{ciusters  signed  to  P,}  4  E"=i(C)  n_iter(*)  n  (i  -  1).  Where:  nJter(* ) 
is  the  number  of  inverse  iterations  performed  for  eigenvalue  i  (typically  3).  If  the  size  of 
the  largest  cluster  is  greater  than  the  processor  which  is  responsible  for  this  cluster  will 
not  be  responsible  for  any  eigenvalues  outside  of  this  cluster. 

Hence,  if  the  size  of  the  largest  cluster  is  greater  than  the  number  of  flops 
performed  by  the  processor  to  which  this  processor  is  assigned  is  (on  average): 


4  n_iter  n  —c2  =  6 n  c2 
2 


9  Although  these  are  not  all  BLAS  Level  1  flops,  they  have  the  same  ratio  memory  operations  to  flops  that 
are  typical  of  BLAS  Level  1  operations. 
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where:  c  =  maX(ye{ciusters}  size(C)  i.e.  the  number  of  eigenvalues  in  the  largest  cluster,  and 
nJter  =  3  is  the  average  number  of  inverse  iterations  performed  for  each  eigenvalue. 

As  the  problem  size  and  number  of  processors  grows,  the  largest  cluster  that 
PDSYEVX  is  able  to  reortliogonalize  properly  gets  smaller  (relative  to  n).  As  a  consequence, 
reorthogonalization  will  not  require  large  execution  time10  Specifically,  if  the  largest  cluster 
has  fewer  than  j  eigenvalues,  (i.e.  fits  easily  on  one  processor)  the  number  of  eigenvalues 
that  will  be  assigned  to  any  one  processor,  and  lienee  the  total  number  of  flops  it  must 
perform,  is  limited.  The  worst  case  is  where  there  are  p  +  1  clusters  each  of  size  ^-j-. 
In  this  case,  one  processor  must  be  assigned  2  clusters  of  size  ^p_,  requiring  (on  average) 
2x6b(^t)2  or  roughly  12^-.  11 

Our  model  for  the  execution  time  of  Gram  Schmidt  re-orthogonalization  (  ^i=i  4  n  * 
=  2nc27i,  where  c  is  the  size  of  the  largest  cluster.)  assumes  that  the  processor  to  which 
the  largest  cluster  is  assigned  is  not  assigned  any  other  clusters.  This  is  true  if  the  largest 
cluster  has  more  than  n/p  eigenvalues  in  it.  If  the  largest  cluster  of  eigenvalues  contains 
fewer  than  n/p  eigenvalues,  reorthogonalization  is  relatively  unimportant. 

fnderjit  Dliillon,  Beresford  Parlett  and  Vince  Fernando’s  recent  work[139,  77]  on 
the  tridiagonal  eigenproblem  substantially  reduces  the  motivation  to  model  the  existing 

ScaLAPACK  tridiagonal  eigensolution  code  in  great  detail,  since  we  expect  them  to  replace 

2  2 

the  current  code  with  something  that  costs  O(^)  flops,  0{^-)  message  volume  and  0{p) 
messages,  which  is  negligible  compared  to  tridiagonal  reduction. 

4.3.3  Load  imbalance  in  bisection  and  inverse  iteration 

Load  imbalance  during  the  tridiagonal  eigendecomposition  is  caused  in  part  by 
the  fact  that  not  all  processes  will  be  assigned  the  same  number  of  eigenvalues  and  eigen¬ 
vectors  and  in  part  by  the  fact  that  different  eigenvalues  and  eigenvectors  will  require 
slightly  different  amounts  of  computation.  Our  experience  indicates  that  the  load  imbal¬ 
ance  corresponds  roughly  to  the  cost  of  finding  two  eigenvalues  (2  X  (53??7_^  +  53  X  5??7i)) 
and  two  eigenvectors  (2  X  (3  717.1.  +  45?77i))  on  one  processor.  Hence,  our  execution  time 
model  for  the  load  imbalance  during  tri diagonal  eigedecomposition  is:  ((2  X  53  +  2  X  3)  = 

10This  is  not  to  suggest  that  reorthogonalization  in  PDSYEVX  gets  better  as  n  and  p  increase,  (indeed 
PDSYEVX  may  fail  to  reortliogonalize  large  clusters  for  large  n  and  p)  It  just  means  that  reorthogonalization 
in  PDSYEVX  will  not  take  a  long  time  for  large  n  and  large  p. 

nThe  appearance  of  p2  in  the  denominator  stems  from  the  restriction  c  <  meaning  that  as  p  increases 
the  largest  cluster  size  that  PDSYEVX  can  handle  efficiently  decreases. 
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112ra7-H  +  (2  X  53  X  5  +  2  x  45)  =  620??7i ) 

In  evaluating  the  cost  of  load  imbalance  in  tridiagonal  eigendecomposition,  one 
must  include  load  imbalance  in  Gram  Schmidt  reorthogonalization.  Indeed  if  the  input 
matrix  has  one  cluster  of  eigenvalues  that  is  substantially  larger  than  all  others  (yet  small 
enough  to  fit  on  one  processor  so  that  PDSYEVX  can  reortliogonalize  it)  Gram  Schmidt 
reorthogonalization  is  very  poorly  load  balanced  and  could  be  treated  almost  enttirely  as  a 
load  imbalance  cost. 

We  do  not  separate  the  load  imalance  cost  of  Gram  Schmidt  from  wliat  the  exe¬ 
cution  time  for  Gram  Schmidt  would  be  if  the  load  were  balanced,  because  doing  so  would 
complicate  the  model  without  making  it  match  actual  execution  time  any  better. 

4.3.4  Execution  time  model  for  tridiagonal  eigendecomposition  in  PDSYEVX 

The  cost  of  tridiagonal  eigedecomposition  in  PDSYEVX  is  the  sum  of  the  cost  of 
bisection,  inverse  iteration  and  reorthogonalization.  Hence: 

HP  71  711  71  P  71  711 

53  —  7^  +  3 - 7^  +  112  ??  7^  +  265  —  71  +  45 - 71  +  620  ??  71  +  2  n  c2  71 

p  '  p  '  p  p 

The  load  imbalance  terms  112??  7.1.  and  530??  71  stem  partly  from  the  fact  that 
some  processors  will  typically  be  assigned  at  least  one  more  eigenvalue  and/or  eigenvector 
than  other  processors  and  from  the  fact  that  both  bisection  and  inverse  iteration  are  iterative 
procedures  requiring  more  time  on  some  eigenvalues  than  on  others. 

4.3.5  Redistribution 

Inverse  iteration  typically  leaves  the  data  distributed  in  a  manner  in  which  it  would 
be  awkward  and  inefficient  to  perform  back  transformation.  If  each  eigenvector  is  computed 
entirely  within  one  processor,  as  PDSTEIN  does,  inverse  iteration  requires  no  communication, 
provided  that  all  processors  have  a  copy  of  the  tridiagonal  matrix  and  the  eigenvalues.  This, 
however,  leaves  the  eigenvector  matrix  distributed  in  a  one- dimensional  manner  in  which 
back  transformation  would  be  inefficient.  Furthermore,  since  different  processors  may  have 
been  assigned  to  compute  a  different  number  of  eigenvectors  (to  improve  orthogonality 
among  the  eigenvectors )  the  eigenvector  matrix  will  typically  not  be  distributed  in  a  block 
cylic  manner.  Since  PDORMTR  (and  all  ScaLAPACK  matrix  transformations)  requires  that  the 
data  be  in  a  2D  block  cyclic  distribution,  the  eigenvectors  must, at  least,  be  redistributed  to 
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a  block  cyclic  distribution.  For  convenience  and  potential  efficiency12,  PDSTEIN  redistributes 
the  eigenvector  matrix. 

The  simplest  method  of  data  redistribution  is  to  have  each  processor  send  one 
message  to  each  of  the  other  processors.  That  message  contains  the  data  owned  by  the 
sender  and  needed  by  the  receiver.  Redistributing  the  data  in  this  manner  requires  that 
each  processor  send  every  element  that  it  owns  to  other  processors13  and  receive  wliat 
it  needs  from  other  processors.  Since  each  processor  owns14  roughly  (nm)/p  elements 
and  needs  roughly  (nm)/p  elements,  the  total  data  sent  and  received  by  each  processor 
is  roughly  '2(nm)/p.  In  our  experience,  data  redistribution  is  slightly  less  efficient  than 
other  broadcasts  and  reductions  and  lienee  we  use  4 (nm)/p/3  as  our  model  for  the  data 
redistribution  cost. 

4.4  Back  Transformation 

Transforming  the  eigenvectors  of  the  tridiagonal  matrix  back  to  the  eigenvectors  of 
the  original  matrix  requires  multiplying  a  series  of  Householder  vectors.  The  Householder 
updates  can  be  applied  in  a  blocked  manner  with  each  update  taking  the  form:  (I -\-VTVt  ), 
where  V  E  Rn>’ nb  is  the  matrix  of  Householder  vectors,  and  T  is  an  (nb  X  nb  a  triangular 
matrix[27]. 

The  following  steps  compute  Z'  =  (/  +  VTVt)Z.  These  are  performed  for  each 
block  Householder  update.  The  major  contributors  to  the  cost  are  noted  below. 

Compute  T 

Computing  the  nb  by  nb  triangular  matrix  T  requires  nb  calls  to  DGEMV,  a  summation 
of  nb2/2  elements  within  the  current  processor  column  and  nb  calls  to  DTRMV.  The 
computation  of  T  need  not  be  in  the  critical  path.  There  are  n/ nb  different  matrices 
T  that  need  to  be  computed,  and  they  could  be  computed  in  advance  in  parallel. 

Compute  W  =  VT  Z. 

Spread  V  across.  Compute  VT Z  locally.  Sum  W  within  each  processor  column. 

12The  actual  efficiency  depends  upon  the  data  distribution  chosen  by  the  user  for  the  input  and  output 
matrices 

lo Although  some  data  will  not  have  to  be  sent  because  it  is  owned  and  needed  by  the  same  processor,  this 
will  typically  be  a  minor  savings. 

14In  the  absence  of  large  clusters  of  eigenvalues  assigned  to  a  single  processor. 
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The  spread  across  of  V  is  performed  on  a  ring  topology  because  the  processor  columns 
need  not  be  synchronized.  Each  processor  column  must  receive  V  and  send  V,  lienee 
the  cost  for  each  processor  column  is:  (2n'  nb)/pr 

The  local  computation  of  VT Z  is  a  call  to  DGEMM  involving  '2(m/pc  +  vnb )(n//pr  + 
nb/2)nb  flops.  Ignoring  the  lower  order  vnb  nb2  term,  this  is: 

2  ( n' m  nb  )/p  +  2  ( n'  vnb )  jpT  +  2  ( m  nb )  jpc  . 


Compute  W  =  TW 
Local. 

Compute  Z  =  Z  —  VW 

Spread  W  down.  Local  computation.  (Note:  V  has  already  been  spread  across.) 

The  local  computation  of  Z  —  VW,  like  the  computation  of  VTZ  involves  a  call  to 
DGEMM  involving  '2(m/pc  +  vnb )(n' /pr  +  nb/2)nb  flops. 

Back  transformation  differs  from  reduction  to  tridiagonal  form  in  many  ways.  It 
requires  many  fewer  messages:  0(n/ nb)  versus  0(n).  Because  the  back  transformation  of 
each  eigenvector  is  independent,  the  Householder  updates  can  be  applied  in  a  pipelined 
manner,  allowing  V  to  be  broadcast  in  a  ring  instead  of  a  tree  topology.  PDLARFB  does 
not  use  the  PBLAS,  allowing  V  to  be  broadcast  once  but  used  twice.  Since  the  number  of 
eigenvectors  does  not  change  during  the  update,  half  of  the  load  imbalance  depends  on 
mod  (n,nbpc)  and  can  be  reduced  significantly  if  mod  (n,nbpc)  =  0.  In  the  following 
table  vnb  is  the  imbalance  in  the  2D  block-cyclic  distribution  of  eigenvectors15 

The  cost  of  back  transformation,  shown  in  table  4.10,  is  asymmetric,  the  (0(n2/pr ) ) 
cost  is  smaller  than  the  (0(n2 /pc))  cost.  Furthermore,  the  (0(n2 /pr))  cost  can  be  reduced 
further  by  computing  T  in  parallel,  and  choosing  a  data  layout  which  will  minimze  vnb. 
Reducing  the  0(n2/pr)  cost  would  allow  pr  <  pc,  reducing  the  0(n2/pc)  costs.  This  is 
discussed  further  in  Chapter  8. 


15vnb  is  computed  as  follows:  extravecsonprocl  —  extravecs/pr.  Where:  extravecs  =  mod  (n,  nbpc)  and 
extravecsonprocl  =  min(nb,  extravecs). 
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Table  4.7:  Computation  cost  in  PDSYEVX 


scale  factor 

update  current  column 
(Table  4.1) 

compute  reflector 
(Table  4.2) 

matrix  vector  product 
(Table  4.4) 

update  vector  product 

(Table  4.5) 

compute  update  vector 

(Table  4.6) 

perform  rank  2k  update 

(Table  4.10) 

tri diagonal  eigendecomposition 

(Section  4.3) 

back  transformation 

(Table  4.10) 

total 

n3 

p  73 

2 

3 

2 

3 

n2m  .. 

—  73 

2 

2 

73  |3W 

to 

2 

3 

2 

3 

n2  nb  pbf 

Pr  72 

l 

l 

n2  nb 

Pr  T'2 

l 

1 

2 

2 

1 

2 

4 

n2  nb 

Pc  72 

1 

2 

1 

2 

n2  nb  pbf 

Pc  73 

l 

l 

n2  nb 

Pc  73 

1 

2 

1 

2 

n2  vnb 

Pc  73 

2 

2 

n2  nb 

Pc  73 

1 

2 

1 

2 

n  m  nb 

Pc  73 

3 

3 

n  , 

3 

3 

n2 

nb2  pc  pbf  3 

2 

2 

n  6  2 

2 

4 

2 

8 

n2  r 

pr  pbf  nb  °2 

2 

2 

n 2  , 

1 

1 

n6  4 

2 

l 

1 

4 

l 

9 
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Table  4.8:  Computation  cost  (tridiagonal  eigendecomposition)  in  PDSYEVX 


scale  factor 

update  current  column 
(Table  4.1) 

compute  reflector 
(Table  4.2) 

matrix  vector  product 
(Table  4.4) 

update  vector  product 
(Table  4.5) 

compute  update  vector 
(Table  4.6) 

perform  rank  2k  update 
(Table  4.10) 

tri diagonal  eigendecomposition 

(Section  4.3) 

back  transformation 

(Table  4.10) 

total 

ir7^ 

53 

53 

3 

3 

nj-L. 

112 

112 

if  7i 

265 

265 

if  7i 

45 

45 

nji 

620 

620 

nc2  71 

6 

6 
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Table  4.9:  Communication  cost  in  PDSYEVX 


scale  factor 

update  current  column 
(Table  4.1) 

compute  reflector 
(Table  4.2) 

matrix  vector  product 
(Table  4.4) 

update  vector  product 
(Table  4.5) 

compute  update  vector 

(Table  4.6) 

perform  rank  2k  update 

(Table  4.10) 

tri diagonal  eigendecomposition 

(Section  4.3) 

back  transformation 

(Table  4.10) 

total 

n  [log2(pe)]a- 

2 

2 

4 

n  |"log2 (iv)]  a 

2 

3 

2 

4 

2 

13 

n  1cm  (pr,pc)/pra 

1 

1 

n  1cm  (pr,pc)/pca 

1 

1 

n  f logo  ( lcm(pr  ,pc  ) )]  a 

1 

1 

’£  flog2(Pc)l/? 

1 

1 

i 

3 

[log2(Pr)l/^ 

1 

i 

2 

^(3 

Pr  ' 

1 

2 

1 

2 

^(3 

Pc  1 

l 

i 

2 

n  nb  [log2(pe)]/t 

l 

1 

-i 

1 

11  nb  [log2(p,.)]/t 

1 

l 

2 

-i 

3 
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Table  4.10:  The  cost  of  back  transformation  (PDORMTR) 


Task 

Filedine  number 

or  subroutine 

Execution  time  con¬ 
tribution  from 
columns  j  =  1  to  n 
shown  explicitly 

Execution  time 
( simplified ) 

Compute  T 

pdsyevx.f :  8  5  5 
pdormtr.fAOS 
pdormqr.f:  394 
pdlarft.f : 

n 

E  (  ^Og2{pr)]^l3  + 

n'= l,nb 

2nb52+24^72+) 

2^2  +  0.5^72 

Pr 

Compute  W  =  VT Z 

pdsyevx.f :  8  5  5 
pdormtr.fAOS 
pdormqr.f  A12 
pdlarfb.f :  3  2  2 , 

398,405 

n 

E  (2  ^ 

n'= l,nb 

+  flo&(Pr)l^/?  +  fi3 

+  2f^473  +  2^4^^ 

i  o  n  n'  nb  _ , 

+  2  p  73  ) 

—4+ 

Pr  ' 

^flog2(Pr)l/?  + 

+  C  +  ^73  + 

2  2 
n  '  vnb  i  n  '  m 

pr  +3+  p  73 

Compute  W  =  TW 

pdsyevx.f :  8  5  5 
pdormtr.fAOS 
pdormqr.f  A12 
pdlarfb.f  A12 

n 

E  (  C3  +  2^4x  ) 

n'= l,nb 

+  C  +  ^73 

Compute  Z  =  Z  —  VW 

pdsyevx.f :  8  5  5 
pdormtr.fAOS 
pdormqr.f  A12 
pdlarfb.f  :41 5 ,425 

n 

E  (rioS2(pdi^/3 

n'= l,nb 

,  c  t  o  m  nb^  _  t 

+  *3+2  2pc  73  + 

2!n^nb73  +  22i^nb73) 

^flog2(Pr)l/?  + 

+C  +  ^x  + 

2  2 
n  '  vnb  I  n  '  m  _ 

pr  +3+  p  73 

Total 

si/3+2^riog2(pr)l/3+2n52+0.5^72+3i53+3’i^nb73 

+  2£_inb73+2£_™73 

Standard  data  layout 

2ltf  n°S2(V?)l  P+2n  52+0.5^+3^53+3^ 

+  2^X3+24^73 

81 


Chapter  5 

Execution  time  of  the  ScaLAPACK 
symmetric  eigensolver,  PDSYEVX  on 
efficient  data  layouts  on  the 
Paragon 


The  detailed  execution  time  model  gives  us  confidence  that  we  understand  the 
execution  time  of  PDSYEVXlt  explains  performance  on  a  wide  range  of  problem  sizes,  data 
layouts,  input  matrices,  computers  and  user  requests.  However,  the  same  complexity  that 
allows  the  detailed  model  to  explain  performance  over  such  a  large  domain  makes  it  difficult 
to  grasp,  understand  and  interpret.  The  simple  six  term  model  shown  in  this  chapter  is 
designed  to  explain  the  performance  of  the  common,  efficient  case  on  a  well  known  computer. 

PDSYEVX  takes  205  seconds  to  compute  the  eigendecomposition  of  a  3840  by  3840 
symmetric  random  matrix  on  a  64  node  Paragon  in  double  precision.  Counting  only  the 
^  n3  flops,  PDSYEVX  achieves  920  Gigaflops  per  second  which  equals  14  Megaflops  per  second 
per  node. 

For  large,  well  behaved1 *,  matrices,  PDSYEVX  is  efficient,  as  detailed  in  Table  5.1. 
For  well  behaved  3840  X  3840  matrices,  PDSYEVX  spends  63%  =  ( 28+35 )%  of  its  time  on 
necessary  computation  and  only  35%  of  its  time  on  communication,  load  imbalance  and 

1For  PDSYEVX’s  purpose,  a  well  behaved  matrix  is  one  which  does  not  have  any  large  clusters  of  eigenvalues 

whose  associated  eigeventers  must  be  computed  orthogonally. 
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Table  5.1:  Six  term  model  for  PDSYEVX  on  the  Paragon 


Component 

Model 

n  =  3840,  p  =  64 
%  time 

matrix  transformation  computation 
(See  section  5.3) 

fi  (7  =  .0215) 

35 

tri  diagonal  eigendecomposition 
computation  (See  section  5.4) 

239  £ 

28 

message  initiation 
(See  section  5.5) 

17  n  log2(  y/p)  (a  =  65.9) 

10 

message  transmission 
(See  section  5.6) 

7^1og2(v^)  (0  =  .146) 

4 

order  n  overhead  &  imbalance 
(See  section  5.7) 

2780  n 

7 

order  n2  overhead  &  imbalance 
(See  section  5.8) 

14.0  4= 

Vp 

14 

overhead  required  for  execution  in  parallel, 
n  Matrix  size 
p  Number  of  processors 

7  Matrix-matrix  multiply  time  (=  .0215  microseconds/flop) 
a  Message  latency  time  (=  65.9  microseconds/message) 

0  Message  throughput  time  (=  .146  microseconds/ word) 

Although  PDSYEVX  is  efficient  on  the  PARAGON2,  Table  5.1  shows  us  that  there  is 
room  for  improvement.  Ignoring  the  execution  time  required  for  solution  of  the  tri diagonal 
eigenproblem  for  the  moment,  we  note  that  the  matrix  transformations  reach  only  about 
50%  of  peak  performance  ( 35%  vs.  35+10-|-4-|-7-|-14=72%) )  for  this  problem  size  (roughly 
the  largest  that  will  ht  on  this  PARAGON).  Furthermore,  efficiency  will  be  lower  for  smaller 
problem  sizes. 

Unfortunately,  there  is  no  single  culprit  that  accounts  for  the  inefficiency.  Com¬ 
munication  accounts  for  a  bit  less  than  half  of  the  inefficiency,  while  software  overhead 
accounts  for  a  bit  more  than  half  of  the  inefficiency. 

2Details  about  the  hardware  and  software  used  for  this  timing  run  are  given  in  table  6.3 
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One  could  argue  that  while  n=3840  on  64  nodes  is  the  largest  problem  that 
PDSYEVX  can  run  on  this  particular  computer,  it  is  stih  a  relatively  small  problem.  How¬ 
ever,  there  are  several  reasons  not  to  ignore  this  result.  First,  while  it  is  true  that  newer 
machines  have  more  memory,  they  also  have  much  faster  floating  point  units,  steeper  mem¬ 
ory  hierarchies  and  few  offer  communication  to  computation  ratios  as  high  as  the  PARAGON. 
Furthermore,  we  should  strive  to  achieve  high  efficiency  across  a  range  of  problem  sizes, 
not  just  for  the  largest  problems  that  can  fit  on  the  computer.  Achieving  high  efficiency  on 
small  problem  sizes  means  that  users  can  efficiently  use  more  processors  and  lienee  reduce 
execution  time. 

In  summary,  PDSYEVX  is  a  good  starting  point,  but  leaves  room  for  improvement. 
However,  significantly  improving  performance  will  require  attacking  more  than  one  source 
of  inefficiency. 

The  fact  that  PDSYEVX  spends  28%  of  its  total  time  in  solving  the  tridiagonal  eigen- 
problem  is  a  result  of  the  slow  divide  on  the  PARAGON.  The  PARAGON  offers  two  divides:  a  fast 
divide  and  a  slow  divide  that  meets  the  IEEE  754  spec[7].  Although  the  ScaLAPACK’s  bisec¬ 
tion  and  inverse  iteration  codes  are  designed  to  work  with  an  inaccurate  divide,  ScaLAPACK 
uses  the  slow  correct  divide  by  default. 

5.1  Deriving  the  PDSYEVX  execution  time  on  the  Intel  Paragon 
(common  case) 

This  six  term  model  is  based  on  the  detailed  model  described  in  section  4  which 
has  been  validated  on  a  number  of  distributed  memory  computers  and  a  wide  range  of  data 
layouts  and  problem  sizes. 

5.2  Simplifying  assumptions  allow  the  full  model  to  be  ex¬ 
pressed  as  a  six  term  model 

I  assume  that  a  reasonably  efficient  data  layout  is  chosen.  I  set  the  data  layout 
parameters  as  follows: 

nb  =  32.  The  optimal  block  size  on  the  Paragon  is  about  10,  however  the  reduction  in 
execution  time  obtained  by  using  nb  =  10  rather  than  nb  =  32  is  less  than  10%),  so 
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we  stick  to  our  standard  suggested  value  of  nb. 

pr  =  pc  =  sjj)-  PDSYEVX  achieves  the  best  performance3  when  pc  <  pr  <  Assuming  that 
pr  =  pc  =  y/p  allows  the  pr  and  pc  terms  to  be  coalesced  into  a  single  ^/p  term. 

pbf  =  2.  The  panel  blocking  factor4  ,  pbf  =  max(2,lcm(pr,pc)/pc)  in  ScaLAPACK  version 
1.5. 

vnb  =  0.  vnb  is  the  imbalance  in  the  number  of  rows  in  the  original  matrix  as  distributed 
amongst  the  processors.  I  assume  that  the  matrix  is  initially  balanced  perfectly 
amongst  all  processors,  i.e.  n  is  a  multiple  of  pr  nb. 

72  =  73  We  assume  for  the  simplified  model  that  all  flops  are  performed  at  the  peak  flop 
rate.  This  introduces  an  error  equal  to  2/3n3/p(72  —  73)  which  is  typically  no  more 
than  2-5%  of  the  total  time  on  the  PARAGON. 

m  =  e  =  n  Assume  that  a  full  eigendecomposition  is  required,  i.e.  all  eigenvalues  are 
required  e  =  n  and  all  eigevectors  are  required  m  =  n. 

c  =  1  Assume  that  the  input  matrix  has  no  clusters  of  eigenvalues. 

I11  addition,  we  set  all  of  the  machine  parameters  to  constants  measured  or  esti¬ 
mated  on  the  Intel  Paragon  as  shown  in  table  6.3  in  order  to  coalesce  the  overhead,  load 

imbalance,  and  tridiagonal  eigen  decomposition  terms  into  just  three  terms. 


5.3  Deriving  the  computation  time  during  matrix  transfor¬ 
mations  in  PDSYEVX  on  the  Intel  Paragon 

Table  5.2  shows  that  PDSYTRD  performs  +  0(n* 2)  flops  per  process.  Of  these, 

+  0(n2)  are  matrix  vector  multiply  flops  and  +  0(n2 )  are  matrix  matrix  multiply 

flops.  PDSYTRD  performs  the  same  floating  point  operations  that  the  LAPACK  routine,  DSYTRD, 
does.  And  |n3  is  the  textbook[84]  number  of  flops  for  reduction  to  tridiagonal  form. 

"  Performance  of  PDSYEVX  is  not  overly  sensitive  to  the  data  layout,  provided  that  nb  is  sufficiently  large 

to  allow  good  DGEMM  performance,  that  the  processor  grid  is  reasonably  close  to  square  and  that  lcm (pr,Pc) 
is  not  outrageous  compared  to  pc  and  pr .  (The  latter  factor  is  only  relevant  when  one  is  dealing  with 

thousands  of  processors.)  I  have  not  performed  a  detailed  study  of  when  using  fewer  processors  results  in 
lower  execution  time.  However,  if  you  drop  processors  only  when  necessary  to  make  pc  <  pr  <  y|  and 
lcmpr,pc  <  10pc  the  processor  grid  chosen  will  allow  performance  within  10%  of  the  optimal  processor  grid. 

4The  matrix  vector  multiplies  are  each  performed  in  panels  of  size  pbfnb.  See  Section  4.2.2. 
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Table  5.2:  Computation  time  in  PDSYEVX 


Task 

Full  model 

Six  term  model 

computation  time  during 
reduction  to  tridiagonal  form 
(See  section  4.2) 

2  n3  2  n3 

37^+57^ 

4  n3 

3  p  ^3 

computation  time  during 
back  transformation 
(See  table  4.10) 

0  m 

2~r13 

9  n'3  /-V 

2—73 

Total 

1°  n3 

—  —K 

Table  5.3:  Execution  time  during  tridiagonal  eigendecomposition 


Task 

Full  model 

Paragon  model 

Paragon  time 

computation  time  during 
tridiagonal  eigendecomposition 
(See  section  4.3) 

r  ^  S  _  1  A  r'  ^  ~  1 

265  ~  Vl+4-5  —  71  + 

53^  7^+3^221  7^_|_ 

P  *  •  '  p  1  •  1 

2  n  c 2  71 

(3r0x.074 

+  56x3.85+0)2^ 

239.  — 

p 

Total 

239.  — 

p 

PDORMTR  performs  2^-  +  0(n2)  flops  per  process.  Again  this  is  the  same  as  the 
LAPACK  routine. 

5.4  Deriving  the  computation  time  during  eigendecomposi¬ 
tion  of  the  tridiagonal  matrix  in  PDSYEVX  on  the  Intel 
Paragon 

The  computation  time  during  tridiagonal  eigendecomposition,  in  the  absence  of 
clusters  of  eigenvalues  is  0(n2)  and  lienee  for  large  n  becomes  less  important. 

The  simplified  model  for  the  execution  time  of  the  tridiagonal  eigensolution  on  the 
PARAGON  in  table  5.3  is  obtained  from  the  detailed  model  by  replacing  71  and  7^  with  their 
values  on  the  PARAGON  and  by  assuming  that  all  clusters  of  eigenvalues  are  of  modest  size. 

Load  imbalance  during  the  tridiagonal  eigendecomposition  is  caused  in  part  by  the 
fact  that  not  all  processes  will  be  assigned  the  same  number  of  eigenvalues  and  eigenvectors 
and  in  part  by  the  fact  that  different  eigenvalues  and  eigenvectors  will  require  slightly  differ¬ 
ent  amounts  of  computation.  Our  experience  indicates  that  the  load  imbalance  corresponds 
roughly  to  the  cost  of  finding  two  eigenvalues  and  two  eigenvectors. 
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Table  5.4:  Message  initiations  in  PDSYEVX 


Task 

Full  model 

Six  term  model 

message  initiation  during 
reduction  to  tridiagonal  form 
(See  table  4.9) 

(13  (log2  (pr  )1  +4  (log2  ( Pc  )1 )  n  a 

17  n  log 2(v?)  « 

Total 

17  n  log 2(v?)  « 

Table  5.5:  Message  transmission  in  PDSYEVX 


Task 

Full  model 

Six  term  model 

message  transmission  time 
during  reduction  to 
tridiagonal  form  ( See 

table  4.9) 

mog2(pc)]£  +  2\log2(pr)]£)l3 

5  lo§2  (vT)  4 

message  transmission  time 
during  back  transformation 
(See  table  4.10) 

2\log2(pr)V^l3 

2^  log 2(77)4 

Total 

7^  log 2(Tp)4 

5.5  Deriving  the  message  initiation  time  in  PDSYEVX  on  the 
Intel  Paragon 

Table  5.4  shows  that  PDSYEVX  requires  17  n  log(^/p)  message  initiations. 

5.6  Deriving  the  inverse  bandwidth  time  in  PDSYEVX  on  the 
Intel  Paragon 

Table  5.5  shows  that  PDSYEVX  transmits  7  n2/^/p  log(  ^Jp)  words  per  node. 

5.7  Deriving  the  PDSYEVX  order  n  imbalance  and  overhead 
term  on  the  Intel  Paragon 


Table  5.6  shows  the  origin  of  the  0(n)  load  imbalance  cost  on  the  Intel  Paragon. 
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Table  5.6:  6(n)  load  imbalance  cost  on  the  PARAGON 


Task 

Full  model 

Paragon  model 

Paragon  time 

load  imbalance  during 
eigendecomposition 
(See  section  4.3) 

62O71  +  1127_^ 

620x0.0740 

+  442x3.85 

477  n 

order  n  overhead  term 
in  reduction  to  tridiagonal  form 
(See  table  4.7) 

9&i  +  6  <*>2 

9x239+6x23.5 

2256  n 

order  n  overhead  term 

in  back  transformation 
(See  table  4.10) 

■2  62 

2  X  23.5 

47  n 

Total 

2780  n 

2 

Table  5.7:  Order  7=  load  imbalance  and  overhead  term  on  the  PARAGON 


Task 

Full  model 

Paragon  model 

Paragon  time 

Order  n2  /  y/p  overhead  term 
in  reduction  to  tridiagonal  form 
(See  table  4.7) 

2  nb  pbf  pc 

+2  2  hf  ^3+p  £1 

nbz  pbf  pc  Pc 

f  2x23.5  |  2x103 

V  32x2  1  32x32  x2 

+3.97)^ 

4.70  + 

Vp 

Order  n2  /  yfp  load  imbalance 
term  in  reduction  to  tridiagonal 
form  ( See  table  4.7) 

7  nb^  ,  1  n2  nb  ^ 

2—32+2—32 

1  n2  nb  pbf  ,  n2  nb„  , 

+  p/  ^2+  pr  73  + 

1  n2  nb  .  1  1  n2  nb  pbf  _ 

2  pc  X3+2  Vr  73 

(|x32  +  |x32 
+32  X  2)  X0.0247 
+  (§x32  +  |x32 
+  2  X32)  X0.024-5 

6.81  + 

Vp 

Order  n2  /  y/p  load  imbalance 
term  in  back  transformation 
(See  table  4.10) 

n  r  n2  nb  „  t  q  n  m  nb  „ 

U.  5 - "yo  +4> - 73 

Pr  1  Pc  M 

1  0  n2  vnb  _ , 

+  2  Pc  73 

(0.5x32x0.0247 

+3x32x0.0245 

+  2x0.0215x0)441 

2.46  + 

Vp 

Total 

14.0  + 

Vp 

2 

5.8  Deriving  the  PDSYEVX  order  +=  imbalance  and  overhead 

VP 


term  on  the  Intel  Paragon 


2  2 

The  order  -^-=  load  imbalance  and  overhead  term  on  the  Intel  Paragon,  ? /  .0  is 

shown  in  table  5.7 

See  section  5.2  for  details  on  the  assumptions  made  to  simplify  the  full  model  to 
the  six  term  model.  Note  that  vnb  is  assumed  to  be  zero  and  that  pbf  is  assumed  to  be  2. 
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Chapter  6 

Perfomance  on  distributed  memory 
computers 

6.1  Performance  requirements  of  distributed  memory  com¬ 
puters  for  running  PDSYEVX  efficiently 

The  most  important  feature  of  a  parallel  computer  is  its  peak  flop  rate.  Indeed, 
everything  else  is  measured  against  the  peak  flop  rate.  The  second  most  important  feature 
is  main  memory,  but  which  feature  of  main  memory  is  most  important  depends  on  whether 
you  want  peak  efficiency  (i.e.  using  as  few  processors  as  possible)  or  minimum  execution 
time  (i.e.  using  more  processors).  If  you  plan  to  use  only  as  many  processors  as  necessary, 
filling  each  processor’s  memory  completely,  then  main  memory  size  is  the  most  important 
factor  controlling  efficiency.  If  you  plan  to  use  more  processors,  main  memory  random 
access  time  becomes  the  most  important  factor. 

Network  performance  of  today’s  distributed  memory  computers  is  good  enough  to 
keep  communication  cost  from  being  the  limiting  factor  on  performance.  Furthermore,  if 
the  network  performance  (either  latency  or  bandwidth)  were  the  limiting  factor,  there  are 
ways  that  we  could  reduce  the  communication  cost  by  as  much  as  log(  ^/p)[107].  Still,  if  one 
has  a  network  of  workstations  connected  by  a  single  etliernet  or  FDDI  ring,  the  very  low 
bisection  bandwidth  will  always  keep  efficiency  low  See  section  8.4.2  for  details. 
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6.1.1  Bandwidth  rule  of  thumb 


Bandwidth  ride  of  thumb:  Bisection  bandwidth  per  processor 1  times  the  square  root 
of  memory  size  per  processor  should  exceed  floating  point  performance  per  processor. 


Megabytes/sec 

processor 


^/Megabytes 

processor 


> 


Megaflops/sec 

processor 


assures  that  bandwidth  will  not  limit  performance. 


The  bandwidth  rule  of  thumb  shows  that  if  memory  size  grows  as  fast  as  peak 
floating  point  execution  rate,  the  network  bisection  bandwidth  need  only  grow  as  the  square 
root  of  the  peak  floating  point  execution  rate.  This  is  very  encouraging  for  the  future  of 
parallel  computing.  This  rule  also  shows  that  the  bandwidth  requirement  grows  as  the 
problem  sizes  decreases.  This  rule  does  not  make  as  wide  a  claim  as  the  memory  rule  of 
thumb,  it  does  not  promise  that  PDSYEVX  will  be  efficient,  only  that  bandwidth  will  not  be 
the  limiting  factor. 

Provided  the  bandwidth  rule  of  thumb  holds,  execution  time  attributable  to  mes¬ 
sage  volume  will  not  exceed  40%  of  the  time  devoted  to  floating  point  execution  in  PDSYEVX 
on  problems  that  nearly  fill  memory. 


6.1.2  Memory  size  rule  of  thumb 

Memory  size  rule  of  thumb:  memory  size  should  match  floating  point  performance 

Megabytes  Megaflops/sec 

processor  processor 

assures  that  PDSYEVX  will  be  efficient  on  large  problems. 

This  rule  is  sufficient  because  it  holds  even  if  message  latency  and  software  over¬ 
head  hold  constant  as  peak  performance  increases  and  network  bisection  bandwidth  and 
BLAS2  performance  increase  as  slowly  as  the  square  root  of  the  increase  in  the  peak  flop  rate. 

1  Bisection  bandwidth  per  processor  is  the  total  bisection  bandwidth  of  the  network  divided  by  the  number 
of  processors. 
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message  transmission  time  7 .bn2 / y/p  flog2(  y/p) ]  /3 

floating  point  execution  time  10/3  3  / p  7.3 

=  7-5  riog2(^)l  13 
10/3  n/ y/p  Is 

_  7.5  [log2(  y/p)  1  /j 

10/3  M  106/ ( 6  X  8)73 

_  7.5  X  3  X  8  •  l(T6/mfr.s 

~~  10/3  M  106/( 6  X  8)10  6 /infs 

7.5  X  8  /6x  8m/.s 
10/3  103  m&.s 


\/M  m&.s 


Table  5.1 


Cancel  n2 /  y/p 


VsM  Words 


Simplify 


.374  = 


7.5  X  3  X  8  031T8 
10/3 103 


Figure  6.1:  Relative  cost  of  message  volume  as  a  function  of  the  ratio  between  peak  floating 
point  execution  rate  in  Megaflops,  infs ,  and  the  product  of  main  memory  size  in  Megabytes, 
M  and  network  bisection  bandwidth  in  Megabytes/sec,  mbs. 


Message  latency  and  software  overhead  are  limited  by  main  memory  access  time,  which  de¬ 
creases  slowly,  but  bisection  bandwidth  and  BLAS2  performance  (which  is  limited  by  main 
memory  bandwidth)  continue  to  improve  though  not  as  rapidly  as  peak  performance. 

When  the  number  of  megabytes  of  main  memory  equals  the  peak  floating  point 
rate  (in  megaflops/sec ),  message  latency  will  typically  account  for  ten  times  less  execution 
time  than  the  time  devoted  to  floating  point  execution  in  PDSYEVX  on  problems  that  nearly 
fill  memory.  The  arithmetic  in  figure  6.2  justifies  this  statement  provided  that  message 
latency  does  not  exceed  100  microseconds. 

The  memory  rule  of  thumb  is  too  simple  to  capture  all  aspects  of  any  computer, 
nonetheless  we  have  found  it  to  be  useful.  The  derivation  in  figure  6.2  makes  two  main 
assumptions:  latency  is  around  100  microseconds  and  |~log2(x/p)~|  =  3.  Selcom  will  either 
be  exactly  correct,  but  in  our  experience  neither  will  tend  to  be  small  by  more  than  a  factor 
of  2  (i.e.  p  /eg4096).  The  memory  rule  of  thumb  also  depends  on  sufficient  bandwidth  and 
on  reasonable  BLAS2  and  software  overhead  costs.  As  we  will  show  next,  network  bandwidth 
capacity  and  BLAS2  performance  need  not  grow  rapidly  to  support  this  rule  and  software 
overhead  costs  need  only  remain  constant. 

The  memory  rule  of  thumb  holds  for  all  computers  marketed  as  distributed  memory 
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message  latency  time 
floating  point  execution  time 


17  n  riog2(v^)l  Q- 
10/3  n3/p  73 

=  17  riog2(  y/P)!  Q' 

10/3  n2/p73 

=  17  flog2(v/p)l  a 

~  10/3  X  (M  106/ 48 )  X  73 

17  x  3  x  100  10-6 
10/3  X  (M  106/ 48 )  X  ( 10-6/ mfs ) 

,  mfs 


=  0.073- 


M 


Table  5.1 


Cancel  n 


Pl(s#VX  Vs  M  Words 

17X3X100X10-6  _  n  nso 
10/3/48  -U.UOl 


Figure  6.2:  Relative  cost  of  message  latency  as  a  function  of  the  ratio  between  peak  floating 
point  execution  rate  in  Megaflops,  mfs ,  and  main  memory  size  in  Megabytes,  M. 


computers,  but  does  not  hold  for  lion-scalable  or  extremely  low  bandwidth  networks.  One 
could  design  a  distributed  memory  computer  for  which  this  rule  does  not  hold,  but  the 
features  that  are  necessary  for  this  rule  to  hold  are  also  important  for  a  range  of  other 
applications  and  lienee  we  expect  this  rule  to  hold  for  essentially  all  distributed  memory 
computers. 

The  memory  rule  of  thumb  while  sufficient  is  not  necessary.  It  is  possible  to  achieve 
efficiency  on  PDSYEVX  on  computers  whose  memory  is  smaller  than  that  suggested  by  this 
rule2.  In  section  6.1.3  I  discuss  wliat  properties  a  computer  must  have  to  allow  efficient 
execution  on  smaller  problem  sizes. 

Though  meeting  the  memory  rule  of  thumb  is  not  necessary  to  achieve  high  perfor¬ 
mance,  there  are  reasons  to  believe  that  it  will  be  useful  for  several  years.  Software  latencies 
are  not  decreasing  rapidly.  Software  overhead,  since  it  is  tied  to  main  memory  latency,  is 
not  decreasing  rapidly  either.  Bisection  bandwidth  and  BLAS2  performance  is  increasing, 
but  not  as  fast  as  peak  floating  point  efficiency. 

On  the  other  hand,  improvements  to  PDSYEVX  will  make  it  possible  to  achieve  high 
performance  with  less  memory  and  may  someday  obsolete  the  memory  rule  of  thumb. 

2The  PARAGON  is  an  example. 
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6.1.3  Performance  requirements  for  minimum  execution  time 

If  you  intend  to  use  as  many  processors  as  possible  to  minimize  execution  time, 
the  second  most  important  machine  characteristic  (after  peak  floating  point  rate)  is  main 
memory  speed.  Main  memory  speed  affects  three  of  the  four  sources  of  inefficiencies  in 
PDSYEVX:  message  initiation,  load  imbalance  and  software  overhead.  Message  initiation  and 
software  overhead  costs  are  controlled  by  how  long  it  takes  to  execute  a  stream  of  code 
with  little  data  or  code  locality.  Since  the  communication  software  initiation  code  offers 
little  code  or  data  locality,  its  execution  time  is  largely  dependent  on  main  memory  latency. 
Load  imbalance  consists  mainly  of  BLAS2  row  and  column  operations.  The  BLAS2  flop  rate 
is  controlled  by  main  memory  bandwidth.  Smaller  main  memory  bandwidth  also  requires  a 
larger  blocking  factor  in  order  to  achieve  peak  floating  point  performance  in  matrix  matrix 
multiply.  Larger  blocking  factors  mean  more  BLAS2  row  and  column  operations.  Hence 
reduced  main  memory  speed  has  a  double  effect  on  the  cost  of  row  and  column  operations: 
increasing  the  number  of  them  while  increasing  the  cost  per  operation. 

Caches  can  be  used  to  improve  memory  performance,  however  the  value  of  caches 
is  reduced  by  several  factors:  The  inner  loop  in  reduction  to  tridiagonal  form,  the  source 
of  most  of  the  inefficiency  in  PDSYEVX,  is  substantial  and  includes  many  subroutine  calls. 
ScaLAPACK  is  a  layered  library  which  includes  the  PBLAS,  BLAS,  BLACS  and  the  underlying 
communication  software.  The  inner  loop  in  reduction  to  tridiagonal  form  touches  every 
element  in  the  unreduced  (trailing)  part  of  the  matrix.  The  second  level  cache  is  typically 
shared  between  code  and  data.  Even  the  way  that  BLAS  routines  are  typically  coded  impacts 
the  value  of  caches  in  PDSYEVX.  The  fact  that  the  inner  loop  in  reduction  to  tridiagonal  form 
includes  many  subroutine  calls  combined  with  ScaLAPACK’s  layered  approach  means  that 
this  inner  loop  typically  involves  many  code  cache  misses.  Indeed  even  the  much  simpler 
inner  loop  in  LIT  involves  many  code  cache  misses  in  ScaLAPACK[160].  Since  this  same  inner 
loop  touches  every  element  in  the  matrix,  the  secondary  cache,  typically  shared  by  both 
code  and  data,  will  be  completely  flushed  each  time  through  the  loop  meaning  that  code 
cache  misses  will  have  to  be  satisfied  by  main  memory. 

The  way  that  BLAS  routines  are  typically  optimized  leads  to  a  high  code  cache  miss 
rate.  BLAS  routines  are  typically  coded  and  optimized  by  timing  them  on  a  representative 
set  of  requests[92].  Each  request  however  is  typically  run  many  times  and  the  times  are 
averaged.  Each  run  may  involved  different  data  to  ensure  that  the  times  represent  the  cost 
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of  moving  the  data  from  main  memory.  However,  no  effort  is  made3  to  account  for  the 
cost  of  moving  the  code  from  main  memory.  Hence,  the  code  cache  is  a  resource  to  which 
no  cost  is  assigned  during  optimization.  Loop  unrolling  can  vastly  expand  the  code  cache 
requirements  but  it  can  also  improve  performance,  at  least  if  the  code  is  in  cache.  Hence 
it  is  likely  that  in  optimizing  BLAS  codes,  some  loops  get  unrolled  to  the  point  where  they 
use  half  or  more  of  the  code  cache.  If  two  such  codes  are  called  in  the  same  loop,  code 
cache  misses  are  inevitable.  The  unfortunate  aspect  of  this  is  that  the  hardware  designer  is 
powerless  to  prevent  it.  Increasing  the  size  of  the  code  cache  might  lead  to  even  more  loop 
unrolling  and  even  worse  performance. 

There  are  two  ways  that  hardware  manufacturers  could  make  caches  more  useful. 
One  would  be  to  improve  the  way  that  BLAS  codes  are  optimized  to  ensure  that  the  code 
cache  is  a  recognized  resource  (either  by  measuring  code  cache  use  in  each  call  or  by  having 
the  codes  optimized  on  a  system  with  smaller  cache  sizes  than  those  offered  to  the  public). 
The  second  would  be  to  allow  a  path  from  main  memory  to  the  register  hie  that  bypasses 
the  cache.  In  the  inner  loop  of  reduction  to  tridiagonal  form,  every  element  of  the  matrix 
is  touched,  but  there  is  no  temporal  locality  and  no  point  in  moving  these  elements  up  the 
cache  hierarchy.  If  these  calls  to  the  BLAS  matrix- vector  multiply  routine,  DGEMV,  could  be 
made  to  bypass  the  caches,  these  caches  would  remain  useful  in  the  other  portions  of  the 
code:  i.e.  software  overhead  and  communication  latency.  Even  row  and  column  operations 
would  benefit  because  these  operations  involve  data  locality  across  loop  iterations,  this  data 
locality  is  made  worthless  by  the  fact  that  the  loop  touches  every  element  in  the  matrix 
each  time  through  but  could  be  useful  if  certain  DGEMV  calls  could  be  made  to  bypass  the 
caches.  This  would  require  a  coordinated  software  and  hardware  effort. 

Secondary  caches  are  of  little  importance  in  determining  PDSYEVX  execution  time 
because  the  inner  loop  traverses  the  entire  matrix  without  any  data  temporal  locality  within 
the  loop.  Secondary  caches  are  important  to  achieving  peak  matrix-matrix  multiply  perfor¬ 
mance,  but  that  is  their  only  use  in  PDSYEVX.  This  is  because  in  principle  if  the  secondary 
cache  were  large  enough  and  the  problem  small  enough,  secondary  cache  could  hold  the 
entire  matrix  and  lienee  act  as  fast  main  memory.  Unfortunately,  secondary  caches  are 
never  large  enough  to  support  an  efficient  problem  size. 

I  would  hope  that,  if  there  are  other  applications  like  PDSYEVX  that  could  make 
Tt  is  difficult  to  account  for  the  cost  of  moving  the  code  from  main  memory. 
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efficient  use  of  smaller  faster  memories,  some  vendor  or  vendors  will  build  some  machines 
with  smaller  faster  main  memory.  I  suspect  that  more  applications  need  large  slow  mem¬ 
ory,  than  small  fast  memory.  Indeed,  PDSYEVX,  can  work  well  either  way.  But,  especially 
with  improvements  to  PDSYEVX  that  will  allow  it  to  achieve  high  performance  on  smaller 
problem  sizes,  PDSYEVX  could  achieve  impressive  results  on  a  distributed  memory  machine 
with  half  the  main  memory  now  typical  of  distributed  memory  parallel  computers  if  that 
smaller  main  memory  could  be  made  modestly,  say  20%,  faster.  With  the  out-of-core  sym¬ 
metric  eigensolver  being  developed  by  Ed  D’Azevedo  (based  on  my  suggestion  to  reduce 
main  memory  requirements  from  4 n2  to  |??2  by  using  symmetric  packed  storage  during 
the  reduction  to  tri diagonal  form  and  two  passes  trliougli  back  transformation),  the  main 
memory  requirements  of  PDSYEVX  will  drop  by  a  factor  of  6  to  12,  furthering  the  argument 
for  smaller,  faster  main  memory. 

As  ScaLAPACK  improves,  it  will  be  able  to  achieve  high  efficiency  on  smaller  problem 
sizes.  This  will  mean  that  the  best  machines  for  ScaLAPACK  will  have  less  memory  than 
that  suggested  by  the  memory  rule  of  thumb  at  the  top  of  this  chapter. 

6.1.4  Gang  scheduling 

6.2  sec:gang 

A  code  which  involves  frequent  synchronizations,  such  as  reduction  to  tri  diagonal 
form,  requires  either  dedicated  use  of  the  the  nodes  upon  which  it  runs  or  gang  scheduling. 
If  even  one  node  is  not  participating  in  the  computation,  the  computation  will  stall  at  the 
next  synchronization  point. 

6.2.1  Consistent  performance  on  all  nodes 

A  statically  load  balanced  code,  such  as  PDSYEVX,  will  executed  only  as  fast  as 
the  slowest  node  on  which  it  is  run.  This,  like  the  need  for  gang  scheduling,  is  obvious. 
Yet,  occasionally  nodes  which  have  identical  specifications  perform  differently.  Kathy  Yelick 
noticed  that  some  nodes  CM5  at  Berkeley  were  slower  than  others..  And,  I  have  reason  to 
believe  that  at  least  two  of  the  nodes  on  the  PARAGON  at  Univeristy  of  Tennessee  at 
Knoxville  are  slower  than  the  others  (See  Table  6.3). 

The  people  who  design  and  maintain  distributed  memory  parallel  computers  should 
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Table  6.1:  Performance 


messagelatency  a 

transmissioncostper  word  j3 

BLASlflop  rate  71 

matrix- vector  mutiply 

software  overhead  82 

matrix- vector  mutiply 

flop  rate  72 

matrix-matrix  mutiply 

flop  rate  73 

■1- 

C" 

O) 

"2 

*> 

IBM  SP2 

54 

0.12 

(67) 

.0037 

270 

.25 

(4) 

5 

?? 

p 

PARAGON 

66 

0.14 

(57) 

0.0235 

(42) 

3.8 

(.26) 

80 

P 

?? 

make  sure  that  slow  nodes  are  identified  and  marked  as  such  or  taken  off-line. 

6.3  Performance  characteristics  of  distributed  memory  com¬ 
puters 

6.3.1  PDSYEVX  execution  time  (predicted  and  actual) 

Table  6.3  compares  predicted  and  actual  performance  on  the  Intel  PARAGON.  Actual 
PDSYEVX  performance  never  exceeds  the  performance  predicted  by  our  model  and  usually 
is  within  15%  of  the  predicted  performance.  Every  run  which  shows  actual  execution  time 
which  is  more  than  15%)  greater  than  expected  execution  time  is  marked  with  an  asterisk. 
I  would  be  satisfied  with  a  performance  model  that  is  within  20%)  to  25%),  and  would  not 
expect  this  performance  model  to  match  to  within  15%)  on  other  machines.  I  have  checked 
several  of  these  and  have  noticed  that  in  these  runs  one  or  two  processors  have  noticeably 
slower  performance  on  DGEMV  than  the  other  processors.  I  have  also  rerun  many  of  these 
aberrant  timings  and  for  each  that  I  have  rerun,  at  least  one  of  the  runs  completed  within 
15%)  of  predicted  performance.  Nonetheless,  this  aberrant  behavior  deserves  further  study. 
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PARAGON  MP 

IBM  SP24 

Processor 

50  Mhz  i860  XP 

120  Mhz  POWER2  SC 

Location 

xp  s  5 .  c c s .  or  ill .  gov 

chowder.ccs.utk.edu 

Data  cache 

16K  bytes 

4way  set-associated 
write- back 

32-byte  lines5 

128K  bytes 

Code  cache 

16  Kbytes 

4way  set-associated 
32-byte  blocks 

32K  bytes 

Second  level  cache 

None 

None 

Processors  per  node 

1 

1 

Memory  per  node 

32  Mbytes 

256  Mbytes 

Operating  system 

Paragon  OSF/1  xps5 

1 .0.4  P  I  4  5 

AIX 

ScaLAPACK 

1.5 

1.5 

BLAS 

-lkmath 

-lesslp2 

BLACS 

NX  BLACS 

MPL  BLACS 

Communication  software 

NX 

MPI 

Precision 

Double 

64  bits _ 

Double 

64  bits _ 

Table  6.2:  Hardware  and  software  characteristics  of  the  PARAGON  and  the  IBM  SP2. 
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Table  6.3:  Predicted  and  actual  execution  times  of  PDSYEVX  on  xps5,  an  Intel  PARAGON. 
Problem  sizes  which  resulted  in  execution  time  of  greater  than  15%  greater  than  predicted 
are  marked  with  an  asterix.  Many  of  these  problem  sizes  which  result  in  more  than  15%) 
greater  execution  time  than  expected  were  repeated  to  show  that  the  unusually  large  exe¬ 
cution  times  are  aberrant. 
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nb 
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time 
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time 
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Actual 
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375 

2 

4 

32 

8.51 

8.24 

0.97 
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4 

8 

32 

6.34 

4.65 

0.73* 
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2 

4 

32 

31.2 

30.1 

0.96 
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2 

4 

32 

31.3 
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2 

4 

32 

31.5 

30.1 

0.96 
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2 

4 

32 

41.2 

30.1 

0.73* 
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2 

4 

32 

43.3 

30.1 

0.7* 
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4 

4 

32 

20.3 
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0.93 
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4 

6 

32 
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6 

32 
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4 

6 

32 
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4 

8 

32 

14.1 
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2 

4 

32 
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53.8 

0.96 
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2 

4 

8 

52.9 

54.4 

1 
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4 

2 

32 

56.5 

54.9 

0.97 
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4 

2 

8 

56.2 

59.3 

1.1 

1125 

2 

4 

32 

72.2 

68.8 

0.95 

1125 

4 

8 

32 

38.2 

26.7 

0.7* 
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2 

4 

32 

133 

127 

0.95 
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2 

4 

32 

134 

127 

0.95 
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2 

4 

32 

134 
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0.95 
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2 

4 

32 

176 
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0.73* 
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2 

4 

32 
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127 

0.7* 

1500 

4 

4 

32 

77.2 
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0.94 
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4 

6 

32 

77 

55 
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4 

6 

32 

59.3 

55 

0.93 
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4 

6 

32 

80.9 

55 

0.68* 
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4 

8 

32 

48.6 

45.2 

0.93 

1875 

4 

8 

32 

99.7 

70.9 

0.71* 

2250 

4 

4 

32 
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175 

0.94 

2250 

4 

6 

32 
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32 
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4 

6 

32 
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4 

8 

32 
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102 

0.91 
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32 
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4 
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32 
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191 

0.89 

98 


Chapter  7 

Execution  time  of  other  dense 
symmetric  eigensolvers 


In  this  chapter,  I  present  models  for  performance  of  other  symmetric  eigensolvers.  These 
models  have  not  been  fully  validated,  although  some  have  been  partly  validated. 

7.1  Implementations  based  on  reduction  to  tridiagonal  form 

7.1.1  PeIGs 

PeIGs[74],  like  PDSYEVX,  uses  reduction  to  tridiagonal  form,  bisection,  inverse  iteration  and 
back  transformation  to  perform  the  parall  eigendecomposition  of  a  dense  symmetric  matrix. 
The  execution  time  of  PeIGs  differs  from  that  of  PDSYEVX  for  two  significant  reasons:  PeIGs 
is  coded  differently,  (using  a  different  language  and  different  libraries)  than  PDSYEVX  and 
it  uses  a  different  re-ortliogonalization  strategy.  I  am  more  interested  in  the  difference 
resulting  from  the  different  re-ortliogonalization  strategy. 

In  PDSYEVX  the  number  of  flops  performed  by  any  particular  processor,  p;,  during 
re-ortliogonalization  is:  Ece{ dusters  aligned  to  P,}  4  E"=i(C)  n_iter(* )  ra  (*'  -  1).  Where:  nJter(* ) 
is  the  number  of  inverse  iterations  performed  for  eigenvalue  i  (typically  3).  If  the  size  of 
the  largest  cluster  is  greater  than  the  processor  which  is  responsible  for  this  cluster  will 
not  be  responsible  for  any  eigenvalues  outside  of  this  cluster. 

Hence,  if  the  size  of  the  largest  cluster  is  greater  than  the  number  of  flops 
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performed  by  the  processor  to  which  this  processor  is  assigned  is  (on  average): 

4  n_iter  n  —  c2  =  6 n  c2 
2 

where:  c  =  niaxpe{ciusters}  size(C)  i.e.  the  number  of  eigenvalues  in  the  largest  cluster,  and 
nJter  =  3  is  the  average  number  of  inverse  iterations  performed  for  each  eigenvalue. 

If  the  largest  cluster  has  fewer  than  ^  eigenvalues,  the  number  of  eigenvalues  that 
will  be  assigned  to  any  one  processor,  and  lienee  the  total  number  of  flops  it  must  perform, 
is  limited.  The  worst  case  is  where  there  are  p  +  1  clusters  each  of  size  In  this  case, 

one  processor  must  be  assigned  2  clusters  of  size  ^-j-,  requiring  (on  average)  2x6  n  (^q-)2 

3 

or  roughly  12^-  flops. 

In  contrast,  PeIGs  uses  multiple  processors  and  simultaneous  iteration  to  maintain 
orthogonality  among  eigenvectors  associated  with  clustered  eigenvalues.  Traditional  inverse 
iteration[102]  computes  one  eigenvector  at  a  time,  re-ortliogonalizing  against  all  previous 
eigenvectors  associated  with  eigenvalues  in  the  same  cluster,  after  each  iteration.  PeIGs, 
in  wliat  they  refer  to  as  simultaneous  iteration,  performs  one  step  of  inverse  iteration  on 
all  eigenvectors  associated  with  a  cluster  of  eigenvalues  and  then  reortliogonalizes  all  the 
eigenvectors.  This  allows  the  re-orthogonalization  to  be  performed  efficiently  in  parallel. 

PeIGs  is  more  accurate  but  slower  than  PDSYEVX  if  the  input  matrix  has  large 
clusters  of  eigenvalues1  The  cost  of  re-orthogonalization  in  PeIGs  is  0(n2  c/p)  flops  versus 
0(nc2)  flops  in  PDSYEVX. 

7.1.2  HJS 

Hendrickson,  Jessup  and  Smitli[91]  wrote  a  symmetric  eigensolver,  HJS,  for  the  PARAGON 
which  is  significantly  faster  than  PDSYEVX,  but  which  has  never  been  released,  and  only 
works  on  the  Intel  PARAGON. 

HJS  requires  that  the  data  layout  block  size  be  1,  i.e.  a  cyclic  distribution,  that 
the  processor  grid  be  square,  i.e.  pr  =  pc  and  that  intermediate  matrices  be  replicated 
across  processor  columns  and  distributed  across  processor  rows.  The  requirement  that  the 
processor  grid  be  square  limits  efficiency  when  used  on  a  non-square  processor  gird.  They 
show  that  the  algorithmic  block  size  need  not  be  tied  to  the  data  layout  block  size.  At  the 
time  that  PDSYTRD  was  written,  the  PBLAS  could  not  efficiently  use  a  cyclic  distribution  and 

1  PDSYEVX  can  maintain  orthogonality  among  eigenvectors  associated  with  clusters  up  to  ^  eigenvalues 

easily  and  efficiently. 


100 


did  not  support  matrices  replicated  in  one  processor  dimension  and  distributed  across  the 
other. 

HJS  has  several  advantages  over  PDSYEVX.  It  uses  a  more  efficient  transpose  oper¬ 
ation,  eliminates  redundant  communication,  reduces  the  number  of  messages  by  combining 
some  and  reduces  the  number  of  words  transmitted  per  process  by  using  recursive  halving 
and  doubling.  HJS  also  reduces  the  load  imbalance  by  a  factor  of  sjj)  by  using  a  cyclic  data 
layout  and  using  all  processors  in  all  calculations2 .  ScaLAPACK  will  incorporate  several  of 
these  ideas  into  the  next  version  of  PDSYEVX. 

HJS  notation 

HJS  also  differs  in  a  couple  other  rather  minor  aspects.  They  compute  the  norm 
of  v  in  a  manner  which  could  overflow,  and  they  represent  the  reflector  in  a  manner  could 
likewise  overflow.  These  reduce  execution  time  and  program  complexity  slightly. 

Their  manner  of  counting  the  cost  of  messages  in  their  performance  model  differs 
from  ours  also.  They  count  the  cost  of  a  message  swap  (sending  a  message  to  and  simul¬ 
taneously  receiving  a  message  from  another  processor)  as  equal  to  cost  of  sending  a  single 
message.  This  reflects  reality  on  the  PARAGON  and  many  but  not  all  distributed  memory  ma¬ 
chines.  Using  their  method  would  not  significantly  change  the  model  for  PDSYEVX  because 
PDSYEVX  does  not  use  message  swap  operations. 

In  their  paper[91],  they  use  different  variable  names  for  the  result  of  each  compu¬ 
tation,  and  show  all  indices  explicitly.  Figure  7.1  relates  their  notation  to  ours. 

Figure  7.1:  HJS  notation 
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not  mathematically  identical 

2PDSYTRD  uses  only  pr  processors  in  many  computations 
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7.1.3  Comparing  the  execution  time  of  HJS  to  PDSYEVX 

The  HJS  implementation  of  parallel  blocked  Household  tri-diagonalization  performs  essen¬ 
tially  the  same  computation  as  PDSYEVX.  The  difference  is  in  the  communication,  load 
balance  and  overhead  costs.  However,  the  operations  are  not  performed  in  the  same  order, 
and  lienee  the  steps  don’t  match  exactly.  Some  of  the  costs,  particularly  communication 
costs,  could  easily  have  been  assigned  to  a  different  operation  than  the  one  that  I  assigned 
them  to.  Hence,  the  execution  time  models  for  each  of  the  individual  tasks  should  not  be 
taken  in  isolation  but  understood  as  an  aid  in  understanding  the  total. 


Updating  the  current  column  of  A  (Line  1.1  in  Figure  7.2) 


As  shown  in  table  4.1,  the  cost  of  updating  the  current  column  of  A  in  PDSYTRD  is: 


iiP1  n  b 

'2n  |"log2(  s/p)~\  a+n  nb  flog2(  y/p)]/3+2  n82  +  — —  72  +  2rc  84 

VP 


In  Figure  6 [9 1]  steps  Y2,  10.1,  10.2  and  10.3  of  HJS  are  involved  in  updating  the  current 
column  of  A  and  the  cost  of  these  steps  is: 

1  n2  n2  nb 

n  a+-—/j  +  2  n  t)2H - 72  • 

2  s/p  P 

In  PDSYEVX,  a  small  part  ofvT  and  wT  must  be  broadcast  within  the  current  column 
of  processors.  In  HJS,  there  is  no  need  to  broadcast  vT  because  it  is  already  replicated  across 
all  processor  rows.  Instead  of  broadcasting  the  piece  of  wT  that  is  necessary  for  this  update, 
HJS  transposes  all  of  wT ,  (cost:  n  a-\-  1/2  n2 / y/p/3)  anticipating  the  need  for  this  in  the  rank 
2k  update. 

The  number  of  DGEMV  flops  performed  does  not  change,  but  they  are  distributed 
across  all  of  the  processors  instead  of  being  shared  only  by  one  column  of  processors.  In 
order  to  allow  these  flops  to  be  distributed  across  all  the  processors,  this  update  is  performed 
in  a  right-looking  manner,  i.e.  the  entire  block  column  of  the  remaining  matrix  is  updated 
with  the  Householder  reflector.  In  PDSYEVX,  this  update  is  performed  in  a  left  looking 
manner,  only  the  current  column  is  updated  (with  a  matrix  vector  multiply).  In  PDSYEVX, 
the  right-looking  variant  does  not  spread  the  work  any  better  and  lienee  the  left-looking 
variant  is  preferred  because  it  involves  a  matrix- vector  multiply,  DGEMV,  rather  than  a  rank- 
one  update,  DGER.  Matrix- vector  multiply  requires  only  that  every  matrix  element  be  read. 
A  rank-one  update  requires  that  every  matrix  element  be  read  and  then  re-written. 
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The  S4  term  does  not  exist  for  HJS  because  they  do  not  use  the  PBLAS,  avoiding 
the  error  checking  and  overhead  associated  with  the  PBLAS. 

Computing  the  reflector  (Line  2.1  in  Figure  7.2) 

As  shown  in  table  4.2,  the  cost  in  PDSYTRD  is:  3  n  [4og2(pr)~|  ct+n  S4  . 

I11  Figure  6 [9 1]  steps  2,  3,  4,  5,  6  and  A”  of  HJS  are  involved  in  computing  the  reflector,  and 
the  cost  of  these  steps  is:  n  [dog2(p)~|  a,  a  little  less  than  the  cost  in  PDSYTRD. 

Step  1  in  HJS  is  also  used  in  the  computation  of  the  reflector  in  HJS,  however  step 
1  isn’t  necessary  to  compute  the  reflector,  and  it  is  necessary  for  the  matrix- vector  multiply, 
lienee  I  assign  the  cost  of  Step  1  to  the  matrix- vector  multiply. 

Both  routines  perform  essentially  the  same  operations.  HJS  appends  the  broadcast 
of  A(.J  + 1,  •/)  to  the  computation  of  xnorm  (though  HJS  actually  computes  xnorm 2 ),  which 
HJS  performs  as  a  sum-to-all.  O11  the  other  hand,  they  involve  all  processors  rather  than 
just  one  column  of  processors,  lienee  the  sum  costs  [dog2(p)~|  rather  than  [dog2(pr )] . 

The  difference  in  performance  would  appear  more  dramatic  if  I  included  the  cost 

of  the  BLAS1  operations  in  my  PDSYEVX  model.  I  do  not  because  they  account  for  an 

2 

insignificant  0(^-)7i  execution  time.  HJS  performs  fewer  BLAS1  flops  (because  they  do  not 
go  the  extremes  that  PDSYTRD  does  to  avoid  overflow)  and  the  flops  that  they  perform  are 
distributed  over  all  processors  instead  of  over  only  one  column  of  processors. 


The  cost  of  matrix  vector  multiply(Lines  3. 1-3.6  in  Figure  7.2) 


As  shown  in  table  4.3,  the  cost  of  the  matrix  vector  multiply  in  PDSYTRD  is: 


4  n  flog2(  x/p)]  a +2  n  a +2 


n2riog2(v/P)l 
Vp 

2  n3  _  n2  nb  n2 

-7 —  72+3  — —  72H — -p  <14 

3  P  s/P  s/P 


n 


n 


/HI. 5  —  /3+ 2  n  nb  flog2(  s/p)]  H 

VP  nb 


So 


I11  Figure  6[91]  steps  1,  Yl,  7.1,  7.2,  and  7.3  are  involved  in  matrix  vector  multiply  and  the 
cost  of  these  steps  is: 


\  Th^  3  Th^  1 

2  n  riog2(v^)l  a  +  2na  +  -  —  flog^y/p)]  I3+-—/3  +  2  nS2  +  -—j2  ■ 

The  model  for  HJS  is  much  simpler  because  1 )  the  local  portion  of  the  matrix- vector  multiply 

2 

requires  just  a  single  call  to  DGEMV  and  2)  the  load  imbalance  in  HJS  is  negligible  ( 0( Iy) 
versus  O(-^L))  in  PDSYEVX). 


103 


The  communication  performed  in  HJS  during  the  matrix  vector  multiply  includes: 


Figure  6  [91] 

Execution  time  model 

Broadcast  v  within  a  row 

Step  1) 

n  rioS2(y?)l  Q,+l7f  rio§2(y7)l  P 

Transpose  v  and  y 

Steps  Yl,  7.3 

2nQ+yf/3 

Recursive  halve  p 

Step  7.3 

n  riog2(y7)l  Q,+7^f/3 

The  transpose  operations  take  advantage  of  the  fact  that  pr  =  pc.  Each  processor 
(a,  b)  simply  sends  its  local  portion  of  the  vector  to  processor  (6,  a)  while  receiving  the 
transpose  from  that  same  processor. 

The  recursive  halving  operation  is  a  distributed  sum  in  which  each  of  the  pc  pro¬ 
cessors  in  the  row  starts  with  k  values  and  end  up  with  —  sums. 

1  Pc 

Updating  the  matrix  vector  multiply  (Line  4.1  in  Figure  7.2) 

As  shown  in  table  4.4,  the  cost  of  updating  the  matrix  vector  multiply  in  PDSYTRD  is: 

6  n  flog2(  y/p) ]  a+n  n  nb  ^og2(  /j+ 4  n  <’)2+2  72+4  n  . 

Pr  y P 

In  Figure  6 [9 1]  step  7.4  updates  the  matrix  vector  multiply  and  the  cost  of  this  step  is: 

n2  nb 

2  n  c2H - 72+  • 

P 

Computing  the  companion  update  vector,  w  (Line  5.1  in  Figure  7.2) 

As  shown  in  table  4.5,  the  cost  of  computing  the  companion  update  vector  in  PDSYTRD  is: 

2n  flog2(  x/p)]a+nS4+  . 

In  Figure  6 [91]  steps  8  and  9  compute  the  companion  update  vector  and  the  cost  of  these 
steps  is: 

i/- 

5  n  [log2(  y/p)]  a  +  —  /3  . 

Just  as  in  the  computation  of  the  reflector,  the  0(n2  )  costs  of  the  BLAS1  operations 
is  insignificant.  HJS  performs  these  more  efficiently  than  PDSYEVX,  because  it  uses  all  the 
processors  in  these  computations. 


Perform  the  rank  2k  update(Line  6.3  in  Figure  7.2) 

As  shown  in  table  4.6,  the  cost  of  the  rank  2 k  update  in  PDSYTRD  is: 
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9  ? 

71  71  71 

4^b  n°g2(  v^)l  [log2(  v^)l  li+—/3-2nnb  flog2(  >/p)1  / 3 


n2  r  2  n2  nb 

+4  2  S3+-  73+3  — —  73 

nb  y/ppbf  3  yjp 


In  Figure  6 [9 1]  step  10.4  performs  the  rank  2k  update  and  the  cost  of  this  step  is: 


n2  2 

nb2  sJ7>  3  f 


HJS  does  not  require  any  communication  here  because  W  and  V,  are  already 
replicated  across  the  processor  rows,  while  WT  and  VT  are  already  replicated  across  all 
the  processor  columns. 

Both  HJS  and  PDSYEVX  must  perform  the  rank  2k  update  as  a  series  of  panel 
updates  using  DGEMM.  Both  PDSYTRD  and  HJS  use  a  panel  width  of  twice  the  algorithmic 
blocking  factor. 

Figure  7.2  summarizes  the  main  sources  of  inefficiencies  in  HJS  reduction  to  tridi¬ 
agonal  form. 

Table  7.1  compares  the  execution  time  in  PDSYEVX  and  HJS  reduction  to  tridiagonal 
form.  Each  row  represents  a  particular  operation.  The  second  column  is  the  time  (in 
seconds)  associated  with  the  given  operation  in  PDSYEVX.  The  third  column  shows  the 
number  of  the  given  operation  performed  in  PDSYEVX.  The  product  of  the  third  column 
with  the  first  column,  after  substituting  the  cost  given  for  the  operation  given  in  section  5.2 
and  n  =  4000  and  p  =  64  is  the  second  column.  For  example  the  cost  of  matrix  multiply  flops 
in  PDSYTRD  on  the  PARAGON  is:  2/3  (n  =  4000)3/(p  =  64)  (y3  =  ,0215e-6)  =  14.3.  Likewise, 
the  second  to  last  column  (the  number  of  the  given  operation  performed  in  reduction  to 
tridiagonal  form  in  HJS)  times  the  first  column  equals  the  last  column  (the  time  associated 
with  the  given  operation  in  reduction  to  tridiagonal  form  in  HJS.) 

Columns  4  through  10  represent  unimplemented  intermediate  variations  on  reduc¬ 
tion  to  tridiagonal  form.  Column  4,  labeled  “minus  PBLAS  inefficiencies”  assumes  that  a 
couple  inefficiencies  of  the  PBLAS  are  removed:  (a  bug  in  the  PBLAS  causing  unnecessary 
communication  and  the  PBLAS  overhead).  Column  5,  labeled  “be  less  paranoid”,  assumes 
that  in  addition  PDSYTRD  computes  reflectors  in  the  slightly  faster,  slightly  riskier  manner 
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Figure  7.2:  Execution  time  model  for  HJS  reduction  to  tridiagonal  form.  Line  numbers 
match  Figure  4.5(PDSYEVX  execution  time) 

comput  ation  communication 

overhead  imbalance  latency  bandwidth 

do  ii  =  1,  n,  nb 

mxi  =  min(7i  +  nb,  n) 
do  i  =  ii,  mxi 

Update  current  (ith)  column  of  A 

1.1  transpose  re  n\g(VP)a 

1.2  A  =  A-WVt-VWt 


Compute  reflector 

2.1  v  =  liouse(A) 

Perform  matrix- vector  multiply 

3.1  spread  v  across 

3.2  transpose  v 

3.3  w  =  tril(A)u; 

wT  =  tril(  A,  —  l)vT 

3.5  recursive  halve  w 

3.6  w  =  w  +  transpose  wT 

Update  the  matrix- vector  product 

4.1  w  =  w  —  W  VT v  —  V  WT v 

Compute  companion  update  vector 

5.1  c  =  w  ■  vT; 

w  =  t  w  —  (c  t / 2 )  v 


2  n  lg(  ^/p)  a 


MVP) 


a 


2  nX 

3  p 


72 


r  n2  ig(yF) 
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r  nf_ 

2  Vp 
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2 n  lg(  VP)  a 


r  n± 

2  V? 

r  2ii 
2y7 


3  n  lg(  a 


Vp 


13 


end  do  i  =  ii,  mxi 


Perform  rank  '2k  update 

6.3  A  =  A  -W  VT  -  V  WT  2  S3  §f  7s 
end  do  ii  =  1,  n,  nb 


that  HJS  does.  Column  6  assumes  direct  transpose  operations.  Column  7  assumes  that 
certain  messages  are  combined,  reducing  the  message  latency  cost.  Column  8  assumes  that 
sum-to-all  is  used  instead  of  sum-to-one  follow  by  a  broadcast,  reducing  the  latency  cost. 
Column  9,  assumes  that  V,W,VT,WT  are  stored  replicated  across  processor  columns,  this 
eliminates  all  communication  in  the  rank  2k  update.  Storing  the  data  replicated  also  allows 
all  processors  to  be  involved  in  all  computations,  but  this  is  not  assumed  until  column  11. 
Column  10  assumes  a  cyclic  data  layout,  eliminating  some  load  imbalance.  Column  11  as- 
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sumes  that  all  processors  are  involved  in  all  computations,  eliminating  the  load  imbalance 
which  was  not  eliminated  by  using  a  cyclic  data  layout. 

7.1.4  PDSYEV 

PDSYEV  uses  the  QR  algorithm  to  solve  the  tridiagonal  eigenproblem.  Each  eigenvector  is 
spread  evenly  among  all  the  processors.  Each  processor  redundantly  computes  the  rotations 
and  updates  the  portion  of  each  eigenvector  which  it  owns.  Computing  the  rotations  requires 
0(n2)  flops,  whereas  updating  the  eigenvectors  requires  0(n3)  flops.  Hence  PDSYEV  scales 
reasonably  well  as  long  as  all  the  eigenvectors  are  required. 

Each  rotation  requires  2  divides,  1  square  root  and  approximately  20  to  compute 
and  6  flops  to  apply. 

The  cost  of  the  QR.  based  tridiagonal  eigensolution  in  PDSYEV  is: 

n  1 

Y,  sweeps(  j )  ( ??.  -  j )  ( 2  7^  +  7y  +  20  7i  +  -6  ??.  7i ) 
j=i  P 

On  average,  it  takes  two  sweeps  per  eigenvalue,  so  we  set  sweeps(j)  =  2  and  simplify: 

2  n2 7^.  +  1  n2 7  ,  +  20  n2 71  +  6  ???  ??  71 

p 

7.2  Other  techniques 

7.2.1  One  dimensional  data  layouts 

One  dimensional  data,  layouts  can  improve  the  performance  of  dense  linear  algebra,  codes 
on  modest  numbers  of  processors,  especially  on  one-sided  reductions  like  LIT  arid  QR.  de¬ 
composition.  I11  general,  one  dimensional  data,  layouts  require  fewer  communication  calls 
in  the  inner  loop  but  more  words  transmitted  per  process.  One-sided  reductions  typically 
require  fewer  messages  within  rows  than  within  columns,  sometimes  by  a.  factor  as  high  as 
nb,  other  times  the  advantage  is  a.  more  modest  log(  y/p).  One-sided  reductions  often  require 
fewer  words  to  be  transmitted  between  columns  than  between  rows  of  processors,  usually 
by  a.  factor  of  nb. 

One  dimensional  data,  layouts  also  offer  less  overhead.  Often  a.11  entire  block  column 
can  be  computed  by  a.  call  to  the  corresponding  LAPACK  code  rather  than  the  ScaLAPACK 
code,  saving  significant  overhead  costs. 
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Table  7.1:  Comparison  between  the  cost  of  HJS  reduction  to  tridiagonal  form  and  PDSYTRD 
on  n  =  4000, p  =  64,  nb  =  32.  Values  differing  from  previous  column  are  shaded. 
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Both  LIT  decomposition  and  back  transformation  would  benefit  considerably  from 
one- dimensional  data  layouts  when  p  is  small,  although  the  advantage  would  be  most  pro¬ 
nounced  on  LIT.  One-sided  reductions  require  O(n)  reductions  across  processor  rows  but 
only  O(^)  reductions  across  processor  columns.  On  a  high  latency  system  such  as  a  net¬ 
work  of  workstations,  the  performance  improvement  from  using  a  one- dimensional  data 
layout  could  be  substantial  since  LIT  requires  O(nb)  fewer  messages  on  a  one- dimensional 
data  layout. 

ScaLAPACK  does  not  take  full  advantage  of  one- dimensional  data  layouts  because 
it  calls  the  ScaLAPACK  code  even  when  the  LAPACK  code  would  do  the  job  faster. 

Two-sided  reductions,  such  as  reduction  to  tridiagonal  form,  do  not  benefit  from 
one  dimensional  data  layouts.  Two-sided  reductions  require  O(n)  reductions  across  pro¬ 
cessor  rows  and  O(n)  reductions  across  processor  columns, lienee  eliminating  the  reductions 
across  processor  rows  (by  using  a  ID  data  decomposition)  will  not  substantially  reduce  the 
number  of  messages  in  two-sided  reductions. 

7.2.2  Unblocked  reduction  to  tridiagonal  form 

Unblocked  reduction  to  tri diagonal  form  can  outperform  blocked  reduction  for  small  and 
modest  sized  problems,  especially  if  a  good  compiler  is  available  for  the  inner  kernel.  Un¬ 
blocked  reduction  to  tridiagonal  form  must  perform  all  of  its  flops  as  BLAS2  flops,  whereas 
blocked  reduction  to  tridiagonal  form  performs  half  of  its  flops  as  BLAS3  flops.  However, 
unblocked  reduction  to  tridiagonal  form  requires  much  less  overhead.  Blocked  reduction 
to  tridiagonal  form  requires  at  least  6 n  calls  to  DGEMV.  unblocked  reduction  to  tridiagonal 
form  requires  only  n  calls  to  DSYMV  and  n  calls  to  DGER. 

If  a  compiler  is  available  that  will  efficiently  compile  the  following  kernel,  unblocked 
reduction  to  tridiagonal  form  could  require  only  n  BLAS2  calls  and  still  attain  near  peak 
performance  on  large  problem  sizes,  especially  for  Hermitian  eigenproblems3.  The  kernel 
shown  below  only  requires  each  element  of  A  be  read  once  and  written  once,  while  per¬ 
forming  8  flops.  This  ratio,  1  memory  read,  1  memory  to  8  flops  is  one  that  many  modern 
computers  can  handle  at  near  peak  speed,  even  from  main  memory  -  in  part  because  the 
access  are  essentially  all  stride  1. 


"  Complex  arithmetic  requires  only  half  as  much  memory  traffic  per  flop 
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for  i  =  1,  n  { 

for  j  =  1,  i  { 

A(i, j)  =  A(i, j)  -  v(i)  *  wt(j)  -  w(i)  *  vt(j); 
nwt(i)  =  nwt(i)  +  A(i,j)  *  nv(j); 
nw(j)  =  nw(j)  +  A(i,j)  *  nvt(i); 

> 

> 

7.2.3  Reduction  to  banded  form 

Reducing  a  dense  matrix  to  banded  form  can  be  more  efficient  than  reduction  to  tridiagonal 
form[24,  25,  116],  however  it  is  not  clear  that  this  can  be  made  to  be  fast  enough  to  overcome 
the  added  costs  to  the  rest  of  the  code.  Reduction  to  banded  form  requires  less  execution 
time  than  reduction  to  tridiagonal  form  because  it  requires  fewer  messages  0(n/ nb)  instead 
of  0(n)  and  because  asymptotically  all  of  the  flops  can  be  performed  as  BLAS3  flops  rather 
than  half  BLAS2  flops. 

An  efficient  eigensolver  based  on  reduction  to  banded  form  could  be  designed  as  follows: 
Reduce  to  banded  form 

Reduce  from  banded  form  to  tridiagonal  form  (do  not  save  rotations) 

Compute  eigenvalues  using  bisection  on  tridiagonal  form 
Perform  inverse  iteration  on  banded  form 
Back  transform  the  eigenvectors 


This  would  be  even  simpler  if  only  eigenvalues  were  required,  as  that  eliminates  the  inverse 
iteration  and  back  transformation  steps. 

If  only  a  few  eigenvectors  are  required,  one  could  reduce  from  banded  form  to 
tridiagonal  form,  saving  the  rotations.  This  would  allow  the  eigenvectors  to  be  computed 
on  the  tridiagonal  using  inverse  iteration  (or  the  new  Parlett/Dliillon  work).  Then  the 
rotations  could  be  applied  as  necessary  and  finally  the  eigenvectors  would  be  transformed 
back.  This  would  result  in  a  complex  code. 

If  two  step  band  reduction  to  tridiagonal  form  were  performed  as  above  and  the 
eigenvectors  were  computed  on  the  tridiagonal  matrix,  the  cost  of  transforming  them  back  to 
the  original  problem  would  be  at  least  4??3,  adding  60%  more  0(n3)  flops  to  full  tridiagonal 
eigendecomposition.  This  could  be  done  in  two  steps,  applying  first  the  rotations  accrued 
during  reduction  from  banded  to  tridiagonal  form  and  then  transforming  the  eigenvectors 
of  the  banded  form  back  to  the  original  problem.  A  cleaner,  though  more  costly  solution 
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would  be  to  form  the  back  transformation  matrix  after  (or  during)  reduction  to  banded 
form,  update  that  during  reduction  to  banded  form  and  then  use  this  to  transform  the 
eigenvectors  of  the  tridiagonal  back  to  the  original  problem. 

Using  reduction  to  banded  form  in  an  eigensolver  requires,  at  a  minimum,  that 
two  step  band  reduction  to  tridiagonal  form  be  faster  than  direct  reduction  to  tridiagonal 
form.  If  eigenvectors  are  required,  it  must  be  significantly  faster  in  order  to  overcome  the 
additional  '2n3  cost  of  back  transformation. 

So  far,  no  one  has  demonstrated  that  two  step  reduction  to  tridiagonal  form  can 
be  performed  faster  than  direct  reduction  on  distributed  memory  computers.  Alpatov, 
Biscliof  and  van  de  Geijn’s  two-step  reduction  to  tridiagonal  form [173]  is  not  faster  than 
PDSYTRD.  They  assert  that  it  can  be  optimized,  but  that  is  also  true  of  PDSYTRD.  So,  it 
is  not  yet  clear  whether  two-step  reduction  to  tridiagonal  form  will  be  significantly  faster 
than  direct  reduction  to  tridiagonal  form  on  any  important  subset  of  distributed  memory 
parallel  computers. 

I  believe  that  software  overhead  plays  a  significant  role  in  limiting  the  performance 
of  two  step  reduction  to  banded  form. 

7.2.4  One-sided  reduction  to  tridiagonal  form 

Hegland  et  al. [90]  show  that  one  can  reduce  the  C'liolesky  factor  (of  a  shifted  input 
matrix)  to  bidiagonal  form  updating  from  only  one  side.  The  result,  in  their  implementation, 
is  a  code  which  requires  ( 10/3 ?? 3 /p  +  n2  *p)  flops  per  processor,  (n2*p)  words  communicated 
per  processor  and  (n  *  p)  messages  per  processor. 

They  argue  that  this  technique,  despite  requiring  2.5  times  as  many  flops,  yields 
better  performance  on  their  target  machine  than  conventional  methods  for  reduction  to 
tridiagonal  form.  They  use  a  ID  processor  grid,  a  unblocked  algorithm,  a  noil-scalable 
pattern  communication  and  computation  and  ignore  symmetry.  By  ignoring  much  of  the 
conventional  wisdom  they  have  achieved  a  simple,  high  performance  code  for  their  target 
machine  (vector). 
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7.2.5  Strassen’s  matrix  multiply 

The  number  of  flops  in  Strassen’s’  matrix  matrix  multiply  is: 

/  min  ( m  ,n,k)  \  3-los  7 
2m  n  k  ( - J 

2 

Where  si_  is  the  break  even  point  for  a  particular  Strassen’s  implementation,  i.e. 

2 

the  point  at  which  one  additional  Strassen’s  divdied  and  conquer  step  neither  increases  nor 
decreases  execution  time.  Three  factors  contribute  to  preventing  the  use  of  Strassen’s  in 
reduction  to  tridiagonal  form  and  back  transformation: 

s i  is  still  too  large 

2 

Lederman  et  al. [96]  have  reduced  s  i  to  the  range  100  to  500. 

2 

k  is  modest  (where  k  is  the  block  size.) 

We  can  increase  the  block  size,  but  only  at  the  cost  of  additional  load  imbalance. 

n3_l°g '  =  n--193  shrinks  slowly 

Increasing  n  by  enough  to  improve  the  ratio  of  “Strassen  flops”  to  standard  matrix 
multiply  flops  by  50%  requires  a  thousand-fold  increase  in  the  amount  of  memory 
required.  (5--193  ~  .5,  lienee  n  must  increase  by  a  factor  of  32  to  to  improve  the  ratio 
of  Strassen  flops  to  standard  matrix  multiply  flops.)  Improving  the  ratio  of  “Strassen 
flops”  to  standard  matrix  multiply  flops  by  increasing  the  number  of  processors  in¬ 
volved  is  even  difficult.  Although  Chou  et  al.[43]  have  shown  that  7k  processors  can 
be  used  to  do  the  work  of  8k ,  it  takes  75  =  16807  processors  to  get  a  factor  of  two 
advantage  this  way.  (75/85  =  .51) 

It  is  this  last  point  that  prevents  Strassen’s  from  rescuing  ISDA  (which  is  described  below). 

Because  32-’193  ~  0.5,  the  problem  size  must  be  32  s  i  in  order  to  halve  the  number  of  flops 

2 

required  in  ISDA.  Halving  the  number  of  flops  again  would  require  that  n  be  increased  by 
another  factor  of  30,  increasing  memory  by  another  factor  of  900  and  the  total  number 
of  flops,  even  after  the  factor  of  two  savings,  by  |303  =  13,500.  I  have  not  yet  seen  a 
Strassen’s  matrix  matrix  multiply  that  achieves  twice  the  performance  of  a  regular  matrix 
matrix  multiply. 
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Table  7.2:  Fastest  eigendecomposition  method 


n  >  500  y/p 

n  <  500  y/p 

Random  matrices 

Tridiagonal 
(>  4  times  faster) 

Tridiagonal 

Spectrally  diagonally 
dominant  matrices 

Tridiagonal 

Jacobi 

7.3  Jacobi 

7.3.1  Jacobi  versus  Tridiagonal  eigensolvers 

This  section  is  based  on  models  that  have  only  been  informally  validated.  I  have  compared 
my  models  to  those  used  by  Arbenz  and  Slapnicar[9]  and  Littlefield  and  Maschhoff[125]  as 
well  as  against  the  execution  times  reported  in  these  papers  but  have  not  performed  any 
independent  validation.  Hence,  the  opinions  that  I  express  in  this  section  should  be  taken 
as  conjectures. 

Large  matrices4  can  be  solved  faster  by  a  tridiagonal  based  eigensolver  than  by  a 
Jacobi  eigensolver,  but  it  is  likely  that  Jacobi  will  outperform  tridiagonal  based  eigensolvers 
on  small  spectrally  diagonally  dominant  matrices5.  Since  tridiagonal  based  methods  require, 
asymptotically,  no  more  than  a  quarter  as  many  flops  as  blocked  Jacobi  methods,  even  on 
spectrally  diagonally  dominant  matrices,  I  expect  that  tri diagonally  based  methods  will 
win  on  large  matrices,  even  spectrally  diagonally  dominant  ones,  because  tri  diagonal  based 
methods  can  achieve  25%  of  peak  performance  on  large  matrices  as  shown  in  Chapter  5.  I 
also  expect  that  tri  diagonal  based  eigensolvers  will  beat  Jacobi  eigensolvers  on  random  ma¬ 
trices  regardless  of  their  size  because  on  random  matrices,  tri  diagonal  eigensolvers  perform 
roughly  16  times  fewer  flops6  and  I  don’t  think  that  Jacobi  methods  will  be  16  times  faster 
per  flop  regardless  of  the  input  size.  Table  7.2  summarizes  which  eigensolution  method  I 
expect  to  be  faster  as  a  function  of  these  input  matrix  characteristics. 

4 On  current  machines  (n  >  500  y/p)  is  sufficiently  large  to  allow  a  tridiagonal  eigensolver  to  outperform 
Jacobi. 

5 Spectrally  diagonally  dominant  means  that  the  eigenvector  matrix,  or  a  permutation  thereof,  is  diago¬ 
nally  dominant.  Most,  but  not  all,  diagonally  dominant  matrices  are  spectrally  diagonally  dominant.  For 
example  if  you  take  a  dense  matrix  with  elements  randomly  chosen  from  [—1,  1]  and  scale  the  diagonal 
elements  by  le3  the  resulting  diagonally  dominant  matrix  will  generally)  be  spectrally  diagonally  dominant. 
However,  if  you  take  that  same  matrix  and  add  le3  to  each  diagonal  element,  the  eigenvector  matrix  is 
unchanged  even  though  the  matrix  is  clearly  diagonally  dominant. 

6  Assuming  Jacobi  converges  in  an  optimistic  8  sweeps 
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7.3.2  Overview  of  Jacobi  Methods 

Despite  Jacobi’s  simplicity  there  are  several  possible  variants,  especially  for  a  par¬ 
allel  code,  each  of  which  have  advantages.  In  section  7.3.16  I  describe  the  code  that  I  would 
write  if  I  were  going  to  write  a  parallel  code.  I  recommend  a  2D  data  layout  if  one  wishes 
to  be  able  to  run  efficiently  on  large  numbers  of  processors  (say  48  or  more).  However,  a 
ID  data  layout  is  considerably  simpler  to  implement  and  simpler  implementation  translates 
into  less  software  overhead.  On  some  computers,  Jacobi  with  a  ID  data  layout  might  be 
efficient  for  hundreds  of  processors.  I  recommend  using  a  one-sided,  blocked,  non-threshold 
Jacobi[9]  with  a  caterpillar  track  pairing[150]  and  distinct  communication  and  computation 
phases,  but  other  methods  cannot  be  entirely  rejected.  For  a  spectrally  diagonally  matrix 
the  fastest  serial  Jacobi  algorithm  is  a  threshold  Jacobi,  lienee  threshold  methods  cannot  be 
ignored.  A  threshold  method  would  almost  certainly  have  to  be  two-sided,  use  a  different 
pairing  strategy  and  either  a  lion-blocked  code  or  some  unconventional  blocking  strategy. 
Noil-blocked  codes  may  make  sense  for  small  matrices  and  large  numbers  of  processors 
as  well  as  for  machines,  such  as  vector  architectures,  which  offer  comparable  BLAS1  and 
BLAS3  performance.  Overlapping  communication  and  computation  will  save  time,  but  my 
experience  indicates  that  the  savings  is  limited. 

My  recommendation  is  weighted  toward  small  matrices  that  are  modestly  spec¬ 
trally  diagonally  dominant,  but  not  so  dominant  that  certain  matrix  entries  can  be  com¬ 
pletely  ignored.  If  the  input  matrix  is  sparse  and  so  strongly  spectrally  diagonally  dominant 
that  the  matrix  never  fills  in,  one  would  have  to  consider  threshold  methods  and  methods 
that  don’t  update  parts  of  the  matrix  that  remain  zero.  On  the  other  hand,  if  the  matrix 
is  quite  large,  performance  could  be  further  improved  by  using  a  different  data  layout  from 
the  one  that  I  recommend. 

There  are  many  implementation  options  available  to  anyone  writing  a  Jacobi  code. 
I  will  discuss  many  of  these  implementation  options  in  the  following  sections.  Section  7.3.3 
explains  the  basic  variants  and  data  layout  options.  Section  7.3.4  explains  the  computa¬ 
tion  requirements  of  each  of  the  basic  variants.  Section  7.3.5  explains  the  communication 
requirements  of  each  of  the  basic  variants.  Section  7.3.6  discusses  blocking  (both  commu¬ 
nication  and  computation).  Section  7.3.7  discusses  the  importance  of  exploiting  symmetry. 
Section  7.3.8  explains  that  one-sided  methods  need  not  recompute  diagonal  blocks  of  AT A. 
Section  7.3.9  discusses  options  for  the  partial  eigendecomposition  required  by  a  blocked 
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Jacobi  method.  Section  7.3.10  discusses  threshold  strategies.  Section  7.3.12  discusses  pre- 
conditioners.  Section  7.3.13  discusses  overlapping  communication  and  computation. 

7.3.3  Jacobi  Methods 

The  matlab  code  for  the  classical,  two-sided,  Jacobi  method  shown  in  figure  7.3  differs  from 
textbook  descriptions  only  in  that  the  rotation  is  computed  by  calling  parteig  and  the 
off  diagonals  are  compared  to  the  diagonals  (in  the  threshold  test)  in  an  unusual  manner. 
Figure  7.6  gives  inefficient  matlab  code  for  parteig  which  calls  matlab’s  eig( )  routine  and 
sorts  the  eigenvalues  to  guarantee  convergence.  In  a  real  implementation,  parteig  would 
be  one  or  two  sweeps  of  two-sided  Jacobi. 

A  two-sided  blocked  Jacobi  matlab  code  is  given  in  figure  7.4.  Because  the  code  in 
figure  7.3  uses  parteig  to  compute  the  rotations  and  norm  in  the  threshold  test,  the  only 
difference  between  the  blocked  and  unblocked  versions  is  the  definition  of  I  and  J.  parteig 
is  not  typically  a  full  eigendecomposition,  more  often  it  is  a  single  sweep  of  Jacobi. 

The  one-sided  Jacobi  variants  can  operate  on  any  matrix  whose  left  singular  vectors 
are  the  same  as,  or  related  to,  the  eigenvectors  of  the  input  matrix.  This  allows  many  choices 
for  pre-conditioning  the  input  matrix,  several  of  which  are  discussed  in  section  7.3.12 

The  one-sided  Jacobi  methods  lose  symmetry,  but  still  require  fewer  flops  than  the 
two-sided  Jacobi  methods  because  they  do  not  have  to  update  the  eigenvectors  separately'. 
Furthermore,  the  one-sided  Jacobi  methods  always  access  the  matrix  in  one  direction  (by 
column  for  Fortran).  A  typical  one-sided  Jacobi  method  is  shown  in  figure  7.5. 

Parallel  Jacobi  methods  require  two  forms  of  communication.  The  columns  and/or 
rows  of  the  matrix  must  be  exchanged  in  order  to  compute  the  rotations  and  the  rotations 
must  be  broadcast.  The  basic  communication  for  one-sided  Jacobi  is  shown  in  figure  7.7 
while  the  communication  pattern  for  two-sided  Jacobi  is  given  in  figure  7.8. 

7.3.4  Computation  costs 

The  computation  and  communication  cost  for  the  Jacobi  method  which  I  recommend  for 
non-vector  distributed  memory  computers  with  many  nodes,  a  one-sided  blocked  Jacobi  on 
a  2  dimensional  (pr  X  pc)  processor  grid,  is  shown  in  table  7.3.  Definitions  for  all  symbols 
used  here  can  be  found  in  Appendix  A. 

1  They  also  avoid  applying  rotations  from  both  sides,  but  this  advantage  is  negated  by  the  fact  that  they 
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Figure  7.3:  Matlab  code  for  two-sided  cyclic  Jacobi 


function  [Q,D]  =  jac2(A) 

l 

°/0  Classical  two-sided  threshold  Jacobi 

l 


thresh  =  le-15; 
maxiter  =  25; 
n  =  size(A,2) 


iter  =  0 
mods  =  1 
Q  =  eye(n) ; 

while  (iter  <  maxiter  &  mods  >  0  ) 
mods  =  0; 
for  I  =  l:n 
for  J  =  1:1-1 

blkA  =  A(  [J , I] ,  [J,I])  ; 

if  (  norm(blkA-diag(diag(blkA) ) )  >  (  norm(blkA) *thresh) ) 
mods  =  mods  +  1 ; 

[R,D]  =  parteig(A( [J,I] , [J,I] )) ; 

A(  [J , I] , :)  =  R’  *  A(  [J , I] , :)  ; 

A( : , [J , I] )  =  A( : , [J , I] )  *  R  ; 

Q  ( :  ,  [J ,  I]  )  =  Q  ( :  ,  [J ,  I]  )  *  R  ; 

end  °/0  if 
end  °/0  for  J 
end  °/0  for  I 
iter  =  iter  +  1 
end  °/0  while 


D  =  diag(diag(A) ) 
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Figure  7.4:  Matlab  code  for  two-sided  blocked  Jacobi 


function  [Q,D]  =  bjac2(  A  ) 

l 

°/0  Two  sided  blocked  threshold  Jacobi 

l 


maxiter  =  25  ; 
thresh  =  le-15; 
nb  =  1 ; 
n  =  size(A,2) 

iter  =  0; 
mods  =  1 ; 

Q  =  eye(n) ; 

while  (iter  <  maxiter  &  mods  >  0  ) 

A=(A  +  A’)/2;  "/,  restore  symmetry 

mods  =  0; 

for  i  =  l:nb:n 

maxi  =  min(i+nb-l ,n) ; 

I  =  i:maxi; 
for  j  =  l:nb:I-l 

maxj  =  min(j+nb-l ,n) ; 

J  =  j :maxj ; 

blkA  =  A( [J , I] ,  [J,I])  ; 

if  (  norm(blkA-diag(diag(blkA) ) )  >  (  norm(blkA) *sqrt (nb) *thresh) ) 
mods  =  mods  +  1  ; 

[R,D]  =  parteig(A( [J,I] ,  [J,I] ))  ; 

A(  [J , I] , :)  =  R’  *  A(  [J , I] , :)  ; 

A( : , [J , I] )  =  A( : , [J , I] )  *  R  ; 

Q  ( :  ,  [J ,  I]  )  =  Q  ( :  ,  [J ,  I]  )  *  R  ; 

end  °/0  if 
end  °/0  for  j 
end  °/0  for  i 
iter  =  iter  +  1 
end  °/0  while 


D  =  diag(diag(A) )  ; 
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Figure  7.5:  Matlab  code  for  one-sided  blocked  Jacobi 


function  [Q,  D]  =  bjacl(  A  ) 

l 

°/0  One  sided  blocked  Jacobi 

l 


thresh  =  le-15  ; 
nb  =  2  ; 
maxiter  =  25; 
n  =  size(A,2) 

B  =  A; 
iter  =  0  ; 
mods  =  1  ; 

while  (iter  <  maxiter  &  mods  >  0) 
mods  =  0  ; 
for  i  =  l:nb:n 

maxi  =  min(i+nb-l ,n) ; 

I  =  i:maxi; 
for  j  =  l:nb:I-l 

maxj  =  min(j+nb-l ,n) ; 

J  =  j :maxj ; 

blkA  =  A( : ,  [J, I] ) ’  *  A(:,[J,I])  ; 

if  (norm(blkA-diag(diag(blkA) ) )  >  norm(blkA) *sqrt (nb) *thresh) 
mods  =  mods  +  1  ; 

[R,D]  =  parteig(blkA)  ; 

A( : ,  [J , I] )  =  A( : ,  [J,I] )  *  R  ; 

end  °/0  if 
end  °/0  for  j 
end  °/0  for  i 
iter  =  iter  +  1 
end  °/0  while 

D  =  A’  *  A; 

Q  =  A  *  diag(l . / sqrt (diag(D) ) )  ; 


D  =  Q  ’  *  B  *  Q  ; 

D  =  diag(diag(D) )  ; 
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Figure  7.6:  Matlab  code  for  an  inefficient  partial  eigendecomposition  routine 

7. 

°/0  parteig  -  eigendecomposition  with  eigenvalues  sorted 

l 

function  [  Q,  D  ]  =  parteig (  A  ) 

[QQ,DD  ]  =  eig(A)  ; 

[tmp, Index]  =  sort(-  diag(DD)); 

D  =  DD (Index, Index)  ; 
q  =  qq(: , Index)  ; 


Table  7.3:  Performance  model  for  my  recommended  Jacobi  method 


Task 

Cost  per 
parallel  pairing 

Cost  per  sweep 
i.e.  (n/nb2)/(2pc) 
parallel  pairings 

Cost  for  recommended 
data  layout 
(nb=ra/(2pc)  pr  =  16pr=4c/p) 

Move  column 
for  this  pairing® 

22nb£cx 

n 

(«  +  ^d) 

n  ,  i  n?"  n 

—ra-\ - p 

nb  pr 

8V^'+^=P 

diag=,4([/,J],:)'x.4(/,J],:)  ^ 

*3+2^73 

(T7)22iP3  +  f73 

8y753  +  2|-73 

Sum  diag  within  each 
processor  column 

lg(tn)2^o 
+  lg(in)nb2/3 

(^)lg(tnjQ' 

+  ^(Pr)l3 

4y'p(lg(p)-4)cv 

+n2/(^(lS(p)~4)l3 

[Q,D]  =  parteig ( diag)  c 

2nb2(27^_  +7y ) 
+6(2nb)37i 

2  — 7^  +  — 7  . 

Pc  ’  *  Pc  V 

i  24n^nb_ 

+  p,  ^ 

InP  .+Lnt*y 

2  -Jp  W 

+  |^7i  (see  note^) 

Broadcast  Q  within 
each  processor  column 

I*(Pr)2^a 
+  lg(in)nb2/3 

(^)lg(tnjQ' 

+  &MPr)P 

4y'p(lg(p)-4)a' 

+  16, /p  (lg(p)  — 4)/3 

II 

<£> 

53+2-7-(2nb)273 

Pr  v  7 

(T7)22i7C+4f73 

Total 

77Q'+2(77)1s(tMQ'+ 

+  2— 7^  +  — 7  , 

Pc  ' ■  Pc  V 

+  24^71 
+  (7t)27p3+5^73 

8 y7( !g ( P )  - 3 )  «■ +  \  ^  C + 

+  I-F 

+  4  vfV 

+  1^-71  8vFC  +  5p73 

“My  models  assume  that  sends  and  receives  do  not  overlap,  hence  the  factor  of  2.  The  factor  of  ( 2 n b pc/n) 
represents  the  number  of  parallel  pairings  that  can  be  performed  on  the  data  local  to  one  processor  column. 
6Only  A(I ,  :)'  x  A(J,  :)  need  be  computed.  See  section  7.3.8 

c  Partial  eigendecomposition  of  the  ( 2  n  b )  x  (2nb)  matrix  performed  with  one  pass  of  an  unblocked  two- 
sided  Jacobi  method  exploiting  symmetry,  see  column  labeled  “exploiting  symmetry”  in  table  7.6 

d(^)x((^)2  27-)  =  24;"' 


2  p„ 


=  24/36—  =  3/4  — 

1  P  '  P 
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Figure  7.7:  Pseudo  code  for  one-sided  parallel  Jacobi  with  a  2D  data  layout  with  commu¬ 
nication  highlighted 

Until  convergence  do: 

Foreacli  paring  do: 

Move  column  data  (A)  to  adjacent  columns  of  processors 

Compute  AT A  locally  (i.e.  blkA  =  A(:,[I,J])’  *  A(:,[I,J])) 

Combine  ATA  within  each  column  of  processors 

Partial  eigendecomposition  of  diagonal  block  (i.e.  [R,D]  =  eig(ATA)) 

Broadcast  R  within  each  row  of  processors 

Compute  A  R  locally 
End  Foreacli 
End  Until 


Table  7.4  shows  the  estimated  execution  time  for  one  sweep  of  my  recommended 
Jacobi  on  a  matrix  of  size  1000  by  1000  on  a  64-node  PARAGON.  As  this  model  has  not 
been  validated,  these  estimates  must  be  viewed  with  caution.  Actual  performance  will  be 
different,  but  the  model  gives  some  idea  of  how  important  the  various  aspects  may  be. 
This  model  is  given  in  matlab  form  in  section  B.2.1.  Table  7.4  suggests  that  Jacobi  is 
indeed  efficient  (1.68/2.69  =  62%)  even  on  such  small  problems.  It  also  suggests  that  the 
optimal  data  layout  may  be  even  taller  and  thinner  than  my  recommended  data  layout: 
pc  =  32 ;pr  =  2.  A  taller  and  thinner  layout  (specifically  pc  =  64; pr  =  1)  would  double  the 
cost  of  message  transmission  between  columns  but  would  decrease  the  cost  of  the  partial 
eigensolver.  The  cost  of  the  divides  and  square  roots  in  the  partial  eigensolver  would 
decrease  by  a  factor  of  64/32  because  all  64  processors  would  participate  in  the  partial 
eigensolver.  And  the  cost  of  accumulating  the  rotations  within  the  partial  eigensolver  would 
decrease  by  2  X  2  =  4.  The  first  factor  of  2  stems  the  fact  that  all  processors  would  share 
in  the  work,  while  the  second  factor  of  2  stems  from  the  fact  that  the  block  size  would  be 
smaller  by  a  factor  of  2  and  the  cost  of  accumulating  rotations  grows  as  0(n2 nb). 

Table  7.5  gives  computation  cost  models  for  6  one-sided  Jacobi  variants.  These  models  are 
not  complete  (they  overlook  many  overhead  and  load  imbalance  costs),  nor  have  they  been 
validated.  This  table  is  designed  mainly  to  put  the  various  variants  in  perspective  and  not 


must  perform  dot  products  to  form  the  square  sub  matrices  to  be  diagonalized. 
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Table  7.4:  Estimated  execution  time  per  sweep  for  my  recommended  Jacobi  on  the  PARAGON 
on  n=1000,  p=64 


Task 

Performance 

Model 

Operation 

cost® 

Estimated 
time  (seconds) 

Message 

latency 

SyWgO)  “  3)a 

a  =  65. 9e  —  6 

0.01 

Message 

transmission 

between 

columns 

-—l3 

2  y?/J 

/3  =  .146e  —  6 

0.06 

Message 

transmission 

within 

columns 

/3  =  .146e  —  6 

0.01 

Computing 

rotation? 

luL^ . 

2  y/P'~ 

74.  =  3.85e  -  6 

0.24 

Computing 

rotations 

4  V 

7y  =  7.7e  -  6 

0.24 

Accumulating 
rotations 
in  partial 
eigensolver 

3  n3 

4—7i 

71  =  .074e  —  6 

0.43 

Software 

overhead 

8^3 

S3  =  103e  -  6 

0.01 

II 

<£> 

CO 

i-O 

73  =  ,0215e  -  6 

1.68 

Total 

(per  sweep) 

2.68 

See  6.1 
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Figure  7.8:  Pseudo  code  for  two-sided  parallel  Jacobi  with  a  2D  data  layout,  as  described 
by  Schrieber[150],  with  communication  highlighted 

Until  convergence  do: 

Foreacli  paring  do: 

Move  row  and  column  data  (A)  to  diagonally  adjacent  processors 
Compute  partial  eigendecomposition  of  diagonal  block 
Broadcast  R  within  each  row  of  processors 
Broadcast  R’  within  each  column  of  processors 

Compute  R  A  R’  locally 
Compute  Q  R  locally 
End  Foreacli 
End  Until 


to  establish  which  is  best.  Communication  costs  are  considered  in  section  7.3.5 

I  have  attempted  to  list  the  variants  that  have  been  implemented  as  well  as  the 
most  promising  suggestions.  For  each  variant  I  have,  where  appropriate,  followed  my  rec¬ 
ommendations  for  implementing  a  Jacobi  code  made  in  section  7.3.16. 

Table  7.6  gives  performance  models  for  5  commonly  mentioned  two-sided  Jacobi  variants. 
Like  the  performance  models  for  one-sided  Jacobi  variants,  these  models  are  incomplete  and 
have  not  been  validated. 

7.3.5  Communication  costs 

Table  7.7  summarizes  the  communication  costs  for  parallel  Jacobi  methods.  I  assume  that 
the  communication  block  size  is  chosen  to  be  as  large  as  possible. 

A  performance  model  for  Jacobi  could  be  created  by  selecting  the  appropriate 
computation  costs  from  table  7.5  or  table  7.6  and  the  appropriate  communication  cost  from 
table  7.7.  Not  all  load  imbalance  and  overhead  costs  are  covered  in  either  of  these  tables, 
and  the  models  have  not  been  validated. 
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Table  7.5:  Performance  models  (flop  counts)  for  one-sided  Jacobi  variants.  Entries  wliich 
differ  from  the  previous  column  are  shaded. 


“For  parallel  codes  we  assume  that  the  blocksize  is  chosen  to  be  as  large  as  possible  i.e.  nb  =  n/(2pc) 
where  pc  is  the  numer  of  processor  columns.  For  a  serial  code  pc  =  n/[ 2  x  nb)  can  be  arbitrarily  chosen. 
“This  is  the  one-sided  method  used  by  Littlefield  and  Maschoff[125]. 

“This  is  the  method  shown  in  figure  7.5 

“This  is  the  method  used  by  Arbenz  and  Oettli[10] 

“Using  fast  givens  is  often  mentioned,  but  rarely  implemented.  Perhaps  the  benefit  is  not  as  good  as  this 
model  would  suggest. 

^This  is  the  method  shown  in  figure  7.4 

sThis  is  the  method  used  by  Arbenz  and  Slapnicar[9] 

^One  sweep  of  Jacobi  on  an  matrix  of  size  2nb  by  2nb 

‘I  also  assume  that  only  one  processor  in  each  processor  column  is  involved  in  each  partial  eigendecom- 
position. 
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Table  7.6:  Performance  models  (flop  counts)  for  two-sided  Jacobi  variants 


Unblocked 


Blocked" 


Ignore 

symmetry' 


Exploit 

symmetry 


fast 

givens" 


Ignore 

symmetry'' 


Exploit 

symmetry 


part.eig{A([] \  .J],  [/,./]) 


(one  sweep 


f«a(-7*+Tv,) 


|  h2('7*+7i/) 


i#(T'++ts/) 


8n26i~\- 
n2  (7^  +7^) 
+24  n2  nb7i 


8n26’i  + 
n2  (7^+7^) 
+24n2nb7i 


QAQ1  Rotate 
from  both  sides 


4n2  8\  +6n37i 


2n28\  + 

JS 


n2Si  +2n371 


4Pcfe+8n373 


4p2S3  +  4p 


gz 

Update  eigenvectors 


2n28\  +3n37i 


2n28\  +3n37i 


n2Si  +3n37i 


2?V&+4W73 


2Pc53+4W73 


Total  (per  sweep) 


2,J  (7^+77 


|«2( 

+6n2  61 

+9n37i 


in2(7^+7y) 


+  4n2  lj 

+  6»371 


|n2(7^+7y) 

+  2  // 3  S| 

+  4n37i 


8ra25i  + 
n2  (7^+7^) 
+24n2nb7i  + 
6n26i+12n37i 


8n26’i  + 
n2  (7^+7^) 
+24n2nb7i  + 


6n26’i  +  8r; 


Assumed 

nb=:5S- 

2pc 

pc=16pr 


8  n/p 
1  n2 


TX-  + 

T/+ 


8  yp 


8  yp  V 

3  p_5. 

2  V/P  " 

9+71 


8  yp  V 


IX- + 

7  ./+ 


«  TP7^V 

Il-Ly  + 
8  y?  v+ 


iu?i+ 


6+71 


:yp 


7i 


2^«i  + 

yp 

*r*+ 
Hi,  + 

*  vp7y+ 

3  n3  , 

8  “71  + 

O  2 

12—71 

p  ,J- 


27=,li  + 

y?  1 

-—7^+ 

4  yp  7-v 

l*2  + 

4  yp7y+ 


8  p 


■71  + 


1  n  c  1 

2^f«l  + 


8+71 


“For  parallel  codes  we  assume  that  the  blocksize  is  chosen  to  be  as  large  as  possible  i.e.  nb  =  n/(2pc) 
where  pc  is  the  number  of  processor  columns.  For  a  serial  code  pc  =  w/(2  x  nb)  can  be  arbitrarily  chosen. 

"This  is  the  method  used  by  Pourandi  and  Tourancheau[142],  by  Schreiber[150]  and  the  method  described 
in  figure  7.3. 

"Using  fast  givens  is  often  mentioned,  but  rarely  implemented.  Perhaps  the  benefit  is  not  as  good  as  this 
model  would  suggest. 

dThis  is  the  method  shown  in  figure  7.4 

"One  sweep  of  Jacobi  on  an  matrix  of  size  2nb  by  2nb 

fl  also  assume  that  only  one  processor  in  each  processor  column  is  involved  in  each  partial  eigendecom- 
position. 
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Table  7.7:  Communication  cost  for  Jacobi  methods  (per  sweep) 


One-sided 

Two-sided 

1-D  data 
layout a 

2-D  data 
layout h 

1-D  data 
layout c 

2-D  data 
layout d 

2-D  data 
layout e 

exchange 
column  vectors 

4pa-\-2n2  (3 

Tvq+2  £/? 

4pa-\-2n2  (3 

6pclog(v'p)Q,+ 

3  n2  !°g(  -Jp)  ,o 

2  pr  P 

12y/pa  + 

Reduce  .  1 ; .  1 

0 

2 Pc  lg (pr)a  + 

^lg  (Pr)l3 

0 

0 

0 

Broadcast 

rotations-^ 

0 

2 Pc  lg (pr)a  + 

lg (pr)l3 

2plg(p)cv+ 

~ lg(p)4 

4pclog(pr)Q' 

+4 pc  log(pc)cv 
+  |^!og(Pc)/3 
+  !#l°g  (Pr)l3 

8y/pa  + 

2$'3 

“This  is  the  method  used  by  Arbenz  and  Slapnicar[9] 

6This  is  the  method  used  by  Littlefield  and  Maschhoff[125] 
cThis  is  the  method  used  by  Pourzandi  and  Tourancheau[142] 
dThis  is  2D  method  most  likely  to  be  used  today 
eThis  is  method  used  by  Schreiber[150] 

Tin  the  unblocked  methods  we  assume  that  communication  is  blocked  even  though  the  computation  is 
not.  We  also  assume  that  each  rotation  is  sent  as  a  single  floating  point  number.  This  is  natural  if  you  are 
using  fast  Givens  but  requires  extra  divides  and  square  roots  if  fast  Givens  are  not  used. 


7.3.6  Blocking 

Classical  Jacobi  methods  annihilate  individual  entries  whereas  blocked  Jacobi 
methods  use  a  partial  eigendecomposition  on  blocks.  Cyclic  Jacobi  methods  use  fewer 
flops,  especially  if  fast  Givens  rotations  are  used.  But,  almost  all  of  the  floating  point  oper¬ 
ations  in  blocked  Jacobi  methods  are  performed  in  matrix-matrix  multiply  operations,  the 
most  efficient  operation. 

Both  cyclic  and  blocked  Jacobi  methods  can  be  blocked  for  communication.  The 
communication  block  size  need  only  be  an  integer  multiple  of  the  computation  block  size. 
Blocking  for  communication  may  be  more  important  than  blocking  for  computation  because 
it  reduces  the  number  of  messages  by  a  factor  equal  to  the  communication  block  size. 

Blocking  allows  greater  possibilities  for  the  partial  eigendecomposition.  A  better 
partial  eigendecomposition  will  lead  to  faster  convergence.  For  example,  performing  two 
Jacobi  sweeps  in  the  partial  eigendecomposition  would  result  in  fewer  sweeps  through  the 
entire  matrix.  However  initial  experiments  indicate  that  on  random  matrices  the  best  that 
one  can  hope  for  is  a  reduction  of  lg(nb)  in  the  number  of  full  sweeps  even  if  one  uses  a 
complete  eigendecomposition  as  the  “partial  eigendecomposition”. 
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Using  a  block  size  that  is  smaller  than  the  maximum  allowed  (i.e.  nb  <  n/('2pc)) 
offers  various  possibilities.  It  allows  communication  to  be  pipelined  to  some  extent.  Alter¬ 
natively  it  allows  more  than  pc  processors  to  be  involved  in  computing  the  partial  eigende- 
compositions. 

The  per  sweep  cost  of  the  partial  eigensolutions  grows  as  the  square  of  the  block 
size  because  larger  block  sizes  mean  that  fewer  processors  are  involved  in  the  partial  eigen- 
decomposition8  . 

I  recommend  keeping  the  code  simple  by  keeping  the  communication  and  compu¬ 
tation  block  size  equal  and  setting  nb  =  n/('2pc)  so  that  each  parallel  pairing  involves  one 
partial  eigendecomposition  per  processor  column.  Using  a  rectangular  process  grid  such  that 
(16 pr  <  pc  <  32 pr)  requires  a  lower  nb  and  lienee  allows  the  code  to  keep  communication 
and  computation  block  size  equal  while  holding  the  cost  of  the  partial  eigendecomposition 
to  |  to  |^y7i-  On  most  machines  this  will  be  no  more  than  half  the  ^-73  cost,  in  part 
because  the  partial  eigendecomposition  will  fit  in  the  highest  level  data  cache. 

A  larger  computational  block  size  increases  the  cost  of  partial  eigendecomposition 
and  decreases  the  cost  of  the  BLAS3  operations.  Larger  communication  block  size  decreases 
message  latency  cost  but  leaves  less  opportunity  for  overlapping  communication  with  com¬ 
putation.  A  larger  ratio  of  pc  to  pr  increases  message  latency  but  reduces  the  partial 
eigendecomposition  cost9.  See  section  7.3.9  for  details  011  the  partial  eigendecomposition 
cost. 

7.3.7  Symmetry 

Exploiting  symmetry  in  two-sided  Jacobi  methods  is  important  because  it  reduces 
the  number  of  flops  per  sweep  from  12n3  to  8 n3.  However,  exploiting  symmetry  while 
maintaining  load  balance  is  difficult.  If  in  a  blocked  Jacobi  method,  the  block  size  were  set 
to  the  largest  value  possible,  i.e.  ^r,  and  a  standard  rectangular  grid  of  processors  were 
used,  half  of  the  processors  (either  those  above  or  below  the  diagonal)  would  be  idle  all 
the  time.  Using  a  smaller  block  size  would  allow  better  load  balance  but  gives  up  some 
of  the  benefits  of  blocking.  Alternatives,  such  as  using  a  different  processor  layout  for  the 
eigenvector  update,  are  feasible,  but  their  complexity  make  them  unattractive. 

8This  does  not  hold  for  nb  <  n/(2p). 

9Assuming  only  one  processor  per  processor  column  is  involved  in  computing  partial  eigendecompositions. 
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In  one-sided  Jacobi  methods,  AT A  is  symmetric  and  only  half  of  it  need  be  com¬ 
puted.  In  fact,  only  a  quarter  of  it  must  be  computed  as  shown  in  the  following  section. 

7.3.8  Storing  diagonal  blocks  in  one-sided  Jacobi 

One-sided  Jacobi  methods  must  compute  diagonal  blocks  of  AT A.  This  is  shown 
in  the  matlab  code  given  in  figure  7.5  as:  blkA  =  A( :,  [./,  I] )'  *  A(:,  [./,  I]).  This  is  inefficient 
because  not  only  does  it  compute  both  halves  of  a  symmetric  (or  hermitian)  matrix,  but 
A[:,  I]'*A[:,  I]  and  A[:,  -J]'  *A[:, ./]  are  already  known.  They  are  the  diagonal  blocks  returned 
by  parteig  on  the  most  recent  previous  pairing  which  involved  /  and  ./  respectively.  Storing 
these  blocks  for  future  use  avoids  the  need  to  recompute  them,  although  they  may  need  to 
be  refreshed  from  time  to  time  for  accuracy  reasons. 

7.3.9  Partial  Eigensolver 

My  performance  models  suggest  that  execution  time  is  likely  to  be  minimized 
when  the  partial  eigendecomposition  consists  of  either  one  or  two  sweeps  of  Jacobi.  The 
per  sweep  cost  of  the  partial  eigensolver  grows  as  0(n^=?-).  In  my  recommended  Jacobi 
method,  the  partial  eigensolver  consists  of  one  sweep  of  Jacobi  and  based  on  the  data 
layout  which  I  recommend,  and  costs  +  0(-^-=)  or  roughly  10%  to  30%)  of  the  total 

cost  of  the  sweep.  Preliminary  experiments  indicate  that  with  a  block  size  of  32,  using  a 
full  eigendecomposition  instead  of  a  partial  eigendecomposition  may  reduce  the  number  of 
sweeps  by  as  much  as  20%).  Assuming  that  a  full  eigendecomposition  of  a  32  by  32  matrix 
costs  6  times  wliat  a  single  sweep  of  Jacobi  would  cost,  this  analysis  suggests  that  the  added 
cost  of  a  full  eigendecomposition  will  not  reduce  the  number  of  sweeps  sufficiently  to  result 
in  a  net  decrease  in  execution  time,  especially  if  DGEMM  performs  efficiently  on  a  smaller 
block  size10.  On  the  other  hand,  since  most  of  the  advantage  of  a  full  eigendecomposition 
will  come  from  the  second  sweep,  using  two  sweeps  of  Jacobi  in  the  partial  eigensolver 
may  result  in  a  net  decrease  in  execution  time.  This  analysis  depends  on  a  great  many 
assumptions  and  should  be  taken  as  a  guide,  not  a  prediction.  Schreiber[150]  reached  a 
similar  conclusion. 

In  a  lion-blocked  code,  the  “partial  eigendecomposition”  should  consist  of  a  rota¬ 
tion,  i.e.  a  full  eigendecomposition.  In  a  lion-blocked  code,  the  cost  of  the  partial  eigensolver, 
10A  smaller  block  size  reduces  the  cost  of  the  partial  eigensolver. 
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though  still  0(n2 nb),  is  lower  because  nb  =  1  and  for  a  2  by  2  matrix,  a  single  sweep  of 
Jacobi  is  a  full  eigendecomposition.  Except  for  very  small  n ,  say  n  <  100,  partial  eigende- 
compositions,  such  as  those  suggested  by  Gotze[85],  are  not  likely  to  result  in  lower  total 
execution  time. 

In  a  blocked  eigensolver,  one  must  compute  a  partial  eigendecomposition  for  each 
pairing.  Most  commonly,  a  single  sweep  of  two-sided  Jacobi  is  used  as  the  partial  eigende¬ 
composition.  Since  the  elements  in  A[I,I]'  X  A[I,I]  and  A[.J,J]'  X  A[.J,T\  are  involved  in 
more  pairings  than  the  elements  in  A[I,  -J]'  X  A[I,  J]  they  need  not  be  annihilated  in  every 
pairing. 


The  number  of  partial  eigenproblems  that  can  be  performed  simultaneously  is 
2^jj.  If  this  is  less  than  p,  either  the  partial  eigenproblems  must  themselves  be  performed 
in  parallel  or  some  processors  will  be  idle.  Unless  nb  is  quite  large,  say  nb  >  64,  it  is 
likely  to  be  faster  to  compute  them  each  on  a  single  processor,  especially  since  the  partial 
eigendecomposition  is  a  two-sided,  not  one-sided,  sweep. 

If  n/(2nb)  =  pc,  it  is  natural  to  assign  one  processor  within  each  processor  column 
to  perform  the  partial  eigendecomposition.  If  n/(2nb)  >  pc,  each  parallel  pairing  will  have 
more  partial  eigenproblems  than  processor  columns,  lienee  the  code  could  assign  different 
partial  eigenproblems  to  different  processors  within  each  processor  column.  The  other  alter¬ 
native  is  to  increase  pc  (decreasing  pr).  Hence,  assigning  different  partial  eigenproblems  to 
different  processors  within  a  column  only  makes  sense  if  bandwidth  cost  makes  increasing  pc 
unattractive.  On  the  other  hand,  the  only  disadvantage  to  assigning  different  partial  eigen¬ 
problems  to  different  processors  with  a  column  (as  opposed  to  increasing  pc)  is  increased 
code  complexity. 

O  1  •  •  •  T  /  1 

rU 


| 

If  the  cost  of  divisions  and  square  roots  {\n2 — - - — )  is  significant,  one  should 


■4"  p 

consider  inexact  rotations  in  the  partial  eigensolver.  Gotze  points  out  that  one  need  not 
perform  exact  rotations  and  suggests  a  number  of  approximate  rotations  which  avoid  divides 
and  square  roots[85].  It  would  be  counterproductive  to  use  inexact  rotations  (saving  0(n2) 
flops  at  the  expense  of  increasing  the  number  of  sweeps  and  the  accompanying  0(n3)  flops) 
in  a  parallel  cyclic  Jacobi  method.  Likewise  I  would  be  hesitant  to  use  inexact  rotations  in 
the  partial  eigensolver  unless  doing  so  makes  it  feasible  to  perform  two  sweeps  in  the  partial 
eigensolver.  However,  it  is  entirely  possible  that  more  sweeps  with  inexact  rotations  might 
be  better  than  fewer  sweeps  using  exact  rotations  in  the  partial  eigensolver. 

Using  a  classical  threshold  scheme  in  the  partial  eigensolver  is  likely  to  save  little 
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time,  but  using  thresholds  to  perform  more  important  rotations  might  improve  performance. 
A  classical  threshold  scheme  is  not  attractive  because  the  processors  performing  fewer  ro¬ 
tations  would  simply  sit  idle.  However  having  each  processor  compute  the  same  number  of 
rotations,  while  using  thresholds  to  skip  some  rotations  might  allow  the  rotations  performed 
to  be  more  productive. 

7.3.10  Threshold 

For  serial  cyclic  codes,  thresholds  can  significantly  reduce  the  total  number  of 
floating  point  operations  performed,  especially  on  spectrally  diagonally  dominant  matrices. 
Since  Jacobi  methods  are  most  likely  to  be  attractive  on  spectrally  diagonally  dominant 
codes,  thresholds  cannot  be  rejected  as  unimportant.  However  in  a  blocked  parallel  pro¬ 
gram,  entire  blocks  can  only  be  skipped  if  the  whole  block  requires  no  rotations.  As  an 
example,  consider  a  blocked  parallel  Jacobi  eigensolution  of  a  1024  by  1024  matrix  on  a 
1024  node  computer  using  a  block  size  of  16.  This  would  involve  63  (or  64)  steps  each  of 
which  would  consist  of  32  pairings  performed  in  parallel.  Each  pairing  involves  a  partial 
eigendecomposition  of  a  2  X  16  by  2  X  16  matrix.  If  any  of  the  off-diagonal  elements  in  any 
of  the  32  pairings  requires  annihilation,  no  savings  is  achieved  in  that  step.  Hence,  in  the 
worst  case,  if  just  63  of  the  499,001  off-diagonal  elements  (one  per  step)  require  annihilation, 
the  threshold  algorithm  realizes  no  benefit. 

C'orbato[47]  devised  a  method  for  implementing  a  classical  Jacobi  method  in  0(n3) 
time.  His  method  involves  keeping  track  of  the  largest  off-diagonal  element  in  each  column. 
The  cost  of  maintaining  this  data  structure  would  more  than  double  the  cost  of  each  rotation 
and  may  not  lead  to  reduced  execution  time  even  in  serial  codes.  However,  Beresford 
Parlett[137]  pointed  out  to  me  that  one  need  not  keep  track  of  the  true  largest  element  and 
that  each  rotation  must  maintain  the  sum  of  the  squares  of  the  elements,  lienee  allowing 
the  list  of  “largest”  off-diagonal  elements  to  be  out-of-date  would  seriously  undermine  the 
advantage  and  would  significantly  reduce  the  overhead.  This  deserves  further  study. 

Untested  Threshold  methods 

One  could  design  a  code  that  used  variable  block  sizes  and/or  switched  from  a 
one-sided-non-threshold  Jacobi  to  a  two-sided-tliresliold  Jacobi.  A  code  could  even  scan 
the  matrix,  identify  the  elements  that  need  to  be  eliminated  and  select  pairings  and  block 
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sizes  that  would  eliminate  those  elements  as  efficiently  as  possible.  In  our  worst  case  example 
given  in  the  preceding  paragraph,  it  might  be  that  those  63  off  diagonal  elements  could  be 
annihilated  in  just  two  parallel  steps  each  requiring  only  a  two  element  rotation. 

Scanning  all  off-diagonal  elements  and  choosing  the  largest  n  non-interfering  ele¬ 
ments  might  be  an  attractive  compromise  between  the  classical  Jacobi  method  which  ex¬ 
amines  all  off  diagonal  elements  and  annihilates  the  largest  and  the  cyclic  Jacobi  method 
which  annihilates  all  elements  without  regard  to  size.  If  software  overhead  could  be  kept 
modest,  such  a  method  might  on  small  spectrally  diagonally  dominant  matrices.  Precisely 
the  matrices  that  are  best  suited  to  Jacobi  methods. 

Jacobi  methods  that  attempt  to  annihilate  larger  elements,  i.e.  threshold  methods, 
work  best  on  two-sided  Jacobi  methods.  This  is  unfortunate  because  it  appears  that  one¬ 
sided  Jacobi  is  otherwise  preferred. 

As  mentioned  in  section  7.3.9,  thresholds  might  be  useful  in  the  partial  eigende- 
composition. 

7.3.11  Pairing 

The  order  in  which  the  off-diagonal  elements  are  annihilated  is  referred  to  as  the 
pairing  strategy.  Eliminating  off-diagonal  element  A.;j  in  a  two-sided  Jacobi  requires  that 
rows  i  and  j  of  A  and  columns  i  and  j  of  A  be  rotated.  Hence,  rows  i  and  j  of  A  must  be 
distributed  similarly,  i.e.  A.;y.  and  Ajy.  must  both  reside  on  the  same  processor.  Likewise, 
columns  i  and  j  of  A  must  be  distributed  similarly.  Ortliogonalizing  vectors  i  and  j  in 
a  one-sided  Jacobi  also  requires  that  the  two  vectors  be  distributed  similarly.  In  order  to 
annihilate  multiple  off-diagonal  elements  simultaneously,  they  must  reside  on  different  sets 
of  processors. 

The  pairing  strategy  affects  execution  time  through  communications  cost,  num¬ 
ber  of  pairings  per  sweep  and  number  of  sweeps  required  for  convergence.  Different  pair¬ 
ing  strategies  require  different  communication  patterns  and  lienee  different  communication 
costs.  Some  pairings  strategies  require  slightly  more  pairings  than  others.  M  ant  liar  am  and 
Eberlein  argue  that  some  pairings  lead  to  faster  convergence  than  otliers[72]. 

In  this  section,  we  illustrate  two  pairing  strategies,  showing  how  each  would  pair 
8  elements  in  4  sets  at  a  time.  The  elements  might  be  individual  indices  (in  a  lion-blocked 
Jacobi)  or  blocks  of  indices.  The  sets  might  correspond  to  individual  processors  (in  a  ID 
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data  layout)  or  columns  of  processors  (in  a  one-sided  Jacobi  on  a  2D  layout)  or  rows  and 
columns  of  processors  (in  a  two-sided  Jacobi  on  a  2D  layout).  Furthermore,  several  sets 
might  be  assigned  to  the  same  processor  or  column  of  processors. 

The  classic  round  robin  pairing  strategy[84]  leaves  one  element  stationary  and 
rotates  the  other  elements.  As  the  following  diagram  shows,  in  7  pairings,  each  element  is 
paired  exactly  once  with  each  of  the  other  elements.  Elements  3  through  8  follow  elements 
2  through  7  respectively,  while  element  2  follows  element  8. 
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A  slight  variation,  called  the  caterpillar  pairing  metliod[72,  73,  150],  cuts  the 
communication  cost  in  half  at  the  expense  of  increasing  the  number  of  pairings  from  n  —  1 
to  n.  The  caterpillar  method,  modified  so  that  communication  is  always  performed  in  the 
same  direction,  is  shown  below.  Only  the  elements  in  the  top  line  rotate,  and  they  always 
rotate  to  the  left.  The  elements,  shown  in  red,  in  the  bottom  line  get  swapped  into  the  top 
line  one  at  a  time.  In  this  pairing  method,  it  takes  8  pairings  in  order  for  each  element  to 
be  paired  with  every  other  element.  The  swapped  elements  need  not  perform  any  work, 
but  must  exchange  the  blocks  assigned  to  them  prior  to  the  next  communication  step.  This 
pairing  strategy  requires  16  (in  general  2 n)  pairings  to  come  back  to  the  original  pairing, 
but  the  second  n  pairings  duplicate  the  first  n. 
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Manthaxam  and  Eberlein[72]  suggest  that  some  pairing  strategies  may  lead  to 
convergence  in  fewer  steps  than  others. 

7.3.12  Pre-conditioners 

One  sided  Jacobi  methods  compute  eigenvectors  by  orthogonalizing  a  matrix  which 
has  the  same  or  related  left  singular  vectors  as  the  original  matrix.  Some  options  include: 

[U,Df¥ J  =  svd(A);  U  contains  the  eigenvectors  of  A ,  D  is  the  absolute  value  of  the 
eigenvalues  of  A.  This,  method  is  used  by  Berry  and  Sarrioli  21]. 

[U,D,V]  =  svd(chol(  A));  U  contains  the  eigenvectors  of  A,  D  is  the  square  root  of  the 
eigenvalues  of  A.  This  is  used  by  Arbenz  and  Slapnicar[9]  and  is  mathematically 
equivalent  to  classical  Jacobi. 

[Q,R]  =  qr  {A);[UiD,V\  =  svd  ( II ) :  <)  *  U  contains  the  eigenvectors  of  A.  D  contains  the 
absolute  value  of  the  eigenvalues  of  A 
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In  addition,  there  are  pivoting  counterparts  to  both  C'liolesky  and  QR,  indeed  many  fla¬ 
vors  of  QR.  with  pivoting,  which  would  improve  these  pre-conditioners.  If  A  is  spectrally 
diagonally  dominant,  permuting  A  so  that  the  diagonal  elements  are  non-increasing  might 
provide  most  of  the  benefit  that  C'liolesky  with  pivoting  does  and  at  considerably  lower 
cost. 

7.3.13  Communication  overlap 

Overlapping  communication  and  computation  is  attractive  because  in  theory  it  re¬ 
duces  the  total  cost  from  the  sum  of  the  computation  and  communication  costs  to  their  max¬ 
imum.  Arbenz  and  Slapnicar  demonstrated  that  overlapping  communication  and  computa¬ 
tion  is  straightforward  in  a  one-sided  Jacobi  method  with  a  one- dimensional  data,  layout [10]. 
But,  overlapping  communication  and  computation  when  using  a.  two-dimensional  data,  lay¬ 
out  is  not  as  straightforward.  Furthermore,  actual  experience  with  communication  and 
computation  overlap  lias  been  disappointing,  see  section  B.1.6 

7.3.14  Recursive  Jacobi 

The  partial  eigendecomposition  could  be  a.  recursive  call  to  a.  Jacobi  eigensolver.  A 
recursive  Jacobi  could  offer  all  the  benefits  shown  by  Toledo  on  LU[165],  notably  excellent 
use  of  the  memory  hierarchy.  Unfortunately,  each  level  of  recursion  requires  6  calls,  tripling 
the  software  overhead.  Therefore,  the  number  of  subroutine  calls,  and  lienee  the  software 
overhead,  grows  at  an  unacceptably  high  0(nlg^6Q. 

Increasing  software  overhead  in  order  to  reduce  the  number  of  sweeps  will  make 
sense  for  large  matrices  but  not  for  small  matrices.  Since  Jacobi  is  unlikely  to  be  faster  than 
tridiagonal  based  methods  for  large  matrices,  I  feel  that  it  is  more  important  to  concentrate 
on  making  Jacobi  fast  on  smaller  matrices.  Hence,  I  do  not  include  recursion  as  a.  part  of  my 
recommended  Jacobi  method.  Nonetheless,  it  may  be  that  one  step  of  recursion  (tripling 
the  software  overhead)  and  conceivably  two  steps  of  recursion  (increasing  software  overhead 
by  a.  factor  of  9)  may  reduce  total  execution  time,  but  I  would  not  expect  the  improvement 
to  be  significant. 
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7.3.15  Accuracy 

Demmel  and  Veselic[58]  prove  that  on  scaled  diagonally  dominant  matrices,  Jacobi 
can  compute  small  eigenvalues  with  high  relative  accuracy  while  tridiagonal  based  methods 
cannot.  Drmac  and  Veselic[71]  show  that  Jacobi  methods  can  be  used  to  refine  an  eigen- 
solution,  thereby  providing  high  relative  accuracy  on  scaled  diagonally  dominant  matrices 
at  lower  total  cost  than  a  full  Jacobi.  Demmel  et  al. [56]  give  a  comprehensive  discussion  of 
the  situations  in  which  Jacobi  is  more  accurate  than  other  available  algorithms. 

7.3.16  Recommendation 

If  I  were  asked  to  write  one  Jacobi  method  for  all  non- vector  distributed  memory 
computers,  it  would  be  a  one-sided  blocked  Jacobi  method.  It  would  use  a  one- dimensional 
data  layout  on  computers  with  fewer  than  48  nodes  and  a  two-dimensional  data  layout  on 
computers  with  48  or  more  nodes.  It  would  use  16-32  times  as  many  processor  columns  as 
rows  in  a  two-dimensional  data  layout11.  It  would  use  a  computational  and  communication 
block  size  equal  to12  max(?t/(2pc),  8),  leaving  processors  idle  if  8  <  n/('2pc).  It  would 
compute  the  partial  eigendecompositions  on  just  one  processor  in  each  processor  column. 
It  would  avoid  recomputing  diagonal  entries  unnecessarily,  use  a  one- directional  caterpillar 
track  pairing  and  one  sweep  of  Jacobi  for  the  partial  eigendecomposition.  It  would  use  the 
largest  block  size  possible  for  both  computation  and  communication. 

If  I  had  time  to  experiment,  I  would  investigate  different  partial  eigendecompo¬ 
sitions,  pre-conditioners  and  pairing  strategies  in  that  order.  Overlapping  communication 
and  computation  appears  to  offer  greater  performance  improvements  in  theory  than  in  prac¬ 
tice.  I  would  use  thresholds  as  a  part  of  the  stopping  criteria,  but  wouldn’t  count  on  them 
to  avoid  unnecessary  flops.  I  would  check  to  make  sure  that  my  suggested  data  layout  (ID 
for  p  <  48,  16pr  <  pc  <  32pr  for  p  >  48  and  nb  =  max(?t/(2pc),  8)  )  was  reasonable  on 
several  computers,  but  unless  there  was  a  substantial  benefit  to  tuning  the  data  layout  to 
each  machine  I  would  hesitate  to  do  so. 

For  vector  machines  I  recommend  an  unblocked  code  with  fast  Givens  rotations 
if  the  cost  of  BLAS1  operations  is  no  more  than  twice  that  of  BLAS3  operations.  If  the 
BLAS1  operations  cost  just  twice  wliat  BLAS3  operations  cost,  the  flop  cost  in  an  unblocked 

11The  ratio  can  be  made  to  fall  in  the  16-32  range  for  any  number  of  processors  except  1  to  15,  32  to 
63  and  128  to  144.  No  more  than  2.1%  of  the  processors  are  left  idle  following  these  rules. 

12  Definitions  for  all  symbols  used  here  can  be  found  in  Appendix  A. 
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code  would  be  6/5  that  of  the  blocked  code  (because  unblocked  codes  using  fast  Givens 
require  3/5  as  many  flops.  Savings  on  other  aspects  can  be  expected  to  make  up  for  this 
difference  on  all  but  the  largest  matrices.  Communication  should  still  be  blocked  however. 
One-dimensional  data  layout  can  be  used  for  more  nodes  if  a  cyclic  code  is  used,  perhaps 
as  many  as  a  hundred  nodes,  since  block  size  is  not  an  issue.  As  long  as  n  <  2p  a  one¬ 
dimensional  data  layout  is  limited  only  by  communication  costs. 

Combining  elements  of  classical  and  cyclic  Jacobi  is  an  interesting  long  shot.  Clas¬ 
sical  Jacobi  always  annihilates  the  largest  off-diagonal  element  but  requires  0(n4)  compar¬ 
isons  per  sweep13.  Annihilating  the  n  largest  off-diagonal  elements  each  time  would  roughly 
match  the  number  of  comparisons  performed  to  the  number  of  flops  performed.  To  paral¬ 
lelize  this  idea,  one  would  have  to  choose  the  n  largest  non-interfering  elements. 

7.4  ISDA 

The  total  execution  time  for  the  ISDA[97]  for  solving  the  symmetric  eigenprob- 
lem14  will  be  no  less  than  100??3  on  typical  matrices.  The  execution  time  depends  largely 
on  how  many  decouplings  are  required  to  make  each  of  the  smaller  matrices  no  larger  than 
half  the  size  of  the  original  matrix.  It  also  depends  on  the  cost  of  each  decoupling,  but  this 
will  not  vary  that  much. 

The  ISDA  achieves  high  floating  point  execution  rates,  but  in  order  to  beat  tridi¬ 
agonal  methods  it  must  achieve  100/(10/3)  =  30  times  higher  floating  point  rates,  which 
it  does  not.  The  PRISM  implementation  of  ISDA  takes  36  minutes  =  2160  seconds  to 
compute  the  eigendecomposition  of  a  matrix  of  size  4800  by  4800  on  the  100  node  SP2  at 
Argonne[29],  ScaLAPACKs  PDSYEVX  takes  397  seconds  to  compute  the  eigendecomposition 
of  a  matrix  of  size  5000  by  5000  on  a  64  node  SP2[31] .  ISDA  should  not  require  as  large 
a  granularity,  n/ ^/p,  as  PDSYEVX  because  of  its  heavy  reliance  on  matrix-matrix  multiply. 
However,  at  present,  the  PRISM  implementation  is  still  at  least  three  times  slower  than 
PDSYEVX  even  on  small  matrices.  Solving  a  matrix  of  size  800  by  800  on  64  nodes  takes  60 
seconds  using  the  PRISM  ISDA  code,  whereas  PDSYEVX  can  solve  a  matrix  of  size  1000  by 
1000  on  64  nodes  of  an  SP2  in  16  seconds. 

The  cost  of  each  decoupling  depends  upon  how  close  the  split  happens  to  come  to 

loOr  increased  overhead  if  Corbato’s  method[47]  is  used. 

14See  section  2.7.3  for  a  brief  description  of  the  ISDA. 
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a  eigenvalue  of  the  matrix  being  split.  The  number  of  beta  function  evaluations  required 
for  a  given  decoupling  is  roughly  —  log (minien(split  —  A;)),  where  split  is  the  split  point 
selected  for  this  decoupling.  The  distance  between  split  and  the  nearest  eigenvalue  cannot  be 
computed  in  advance  but  is  likely  to  fall  in  the  range:  (log(  n)/  log(  1.5) +  2,  log(  n)/  log(  1.5)  + 
8.  This  is  consistent  with  empirical  results.  For  our  purposes  we  will  say  that  the  number 
of  beta  function  evaluations  is:  (log(  1500)/ log(  1.5)  +  2  =  20.  The  cost  per  beta  function 
evaluation  is  2  matrix-matrix  multiplies  at:  2(??/)3/p73  each,  where  n'  is  the  size  of  the 
matrix  being  decoupled.  Hence  the  cost  for  the  first  decoupling  is:  2  X  2  X  20?t3/p73  = 
80??3/p73. 

If  each  decoupling  splits  the  matrix  exactly  in  half,  round  i  of  decouplings  involves 
2*  decouplings  each  involving  a  matrix  of  size  n/'2l  at  a  total  cost  of:  2*  X  80(?t/2*)3  = 
80??3/4*.  The  sum  of  all  rounds  would  then  be:  J2t=o  80??3/T  =  80  X  4/3  =  107??3. 

The  ISDA  for  symmetric  eigendecomposition  may  require  substantially  longer  on 
some  matrices  with  a  single  cluster  of  eigenvalues  containing  more  than  half  of  the  eigen¬ 
values  and  on  matrices  with  most  of  the  eigenvalues  at  one  end  of  the  spectrum15.  It  is 
unlikely  that  the  first  split  point  chosen  for  decoupling  will  lie  in  the  middle  of  a  cluster. 
Hence,  if  the  matrix  contains  one  large  cluster,  that  cluster  will  likely  remain  completely 
in  one  of  the  two  submatrices,  making  the  decoupling  less  even  and  lienee  less  successful. 
Likewise,  if  most  of  the  eigenvalues  are  at  one  end  of  the  spectrum,  the  submatrix  on  that 
end  of  the  spectrum  will  likely  be  much  larger  than  the  other  after  the  first  decoupling.  If 
each  decoupling  splits  off  only  20%  of  the  spectrum,  the  total  time  will  be  twice  wliat  it 
would  be  if  each  decoupling  splits  the  spectrum  exactly  in  half. 

One  could  check  to  make  sure  that  a  reasonable  split  point  has  been  chosen  by 
performing  an  LDLT  decomposition  on  the  shifted  matrix,  and  counting  the  number  of 
positive  or  negative  values  in  D.  An  LDLT  decomposition  costs  1/3 n3  flops  or  about  0.5%) 
of  the  flops  required  to  perform  the  full  decoupling. 

7.5  Banded  ISDA 

Banded  ISDA  is  very  nearly  a  tridiagonal  based  method  and  lienee  offers  per¬ 
formance  that  is  nearly  as  good  as  tridiagonal  based  methods.  PRISM’s  single  processor 
implementation  of  banded  ISDA  is  two  to  three  times  slower  than  bisection  (DSTEBZ)[26]. 


15Fann  et  al.[75]  present  a  couple  examples  of  real  applications  that  fit  this  description. 
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Computing  eigenvectors  using  banded  ISDA  will  not  only  be  more  difficult  to  code,  it  will 
require  about  twice  as  many  flops  as  inverse  iteration.  Banded  ISDA  requires  additional 
bandwidth  reductions,  each  of  which  requiring  up  to  2?? 3  additional  flops  during  back  trans¬ 
formation16. 

Banded  ISDA  could  make  sense  if  reduction  to  banded  form  were  twice  as  fast  as 
reduction  to  tridiagonal.  Although  even  then  one  has  to  question  whether  it  makes  sense 
to  use  banded  ISDA  instead  of  a  banded  solver. 

Banded  ISDA  should  perform  a  few  shifted  LDLT  decompositions  to  make  sure 
that  the  selected  shift  will  leave  at  least  1/3  of  the  matrix  in  each  of  the  two  submatrices. 

7.6  FFT 

Yau  and  Lu[174]  have  implemented  an  FFT  based  invariant  subspace  decomposi¬ 
tion  method.  It,  like  ISDA,  uses  efficient  matrix-matrix  multiply  flops,  but  since  it  requires 
100??3  flops  the  same  analysis  which  shows  that  ISDA  will  not  be  faster  applies  to  it  as  well. 
Domas  and  Tisseur  have  implemented  a  parallel  version  of  the  Yau  and  Lu  metliod[60]. 


16The  first  bandwidth  reduction  essentially  always  requires  the  full  2n°  flops  during  back  transformation, 
though  later  ones  typically  require  less  than  that.  However,  taking  advantage  of  the  opportunity  to  perform 
fewer  flops  wither  means  a  complex  data  structure  or  that  the  update  matrix  Q  be  formed  and  then  applied, 
adding  another  4/3 W  flops. 
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Chapter  8 

Improving  the  ScaLAPACK 
symmetric  eigensolver 

8.1  The  next  ScaLAPACK  symmetric  eigensolver 

The  next  ScaLAPACK  symmetric  eigensolver  will  be  50%  faster  than  the  ScaLAPACK 
symmetric  eigensolver  in  version  1.5  and  provide  performance  that  is  independent  of  the 
user’s  data  layout.  Separating  internal  and  external  data  layout  will  not  only  make  the  code 
easier  to  use  because  the  user  need  not  modify  their  storage  scheme,  it  will  also  improve 
performance.  The  next  ScaLAPACK  symmetric  eigensolver  will  select  the  fastest  of  four 
methods  for  reduction  to  tridiagonal  form1,  and  use  Parlett  and  Dliillon’s  new  tridiagonal 
eigensolver  [139]. 

Separating  internal  and  external  data  layout  allows  execution  time  to  be  reduced 
for  three  reasons.  It  allows  reduction  to  tri diagonal  form  and  back  transformation  to  use 
different  data  layouts.  It  allows  reduction  to  tridiagonal  form  to  use  a  square  processor  grid, 
significantly  reducing  message  latency  and  software  overhead.  It  allows  the  code  to  support 
any  input  and  output  data  layout  without  all  the  layers  of  software  required  to  support 
any  data  layout.  Last  but  not  least  by  concentrating  our  coding  efforts  on  the  simple,  but 
efficient  square  cyclic  data  layout,  we  can  implement  several  reduction  to  tridiagonal  codes 
and  incorporate  ideas  that  would  be  prohibitively  complicated  in  a  code  that  had  to  support 
multiple  data  layouts. 

1On  machines  where  timers  are  not  available,  a  heuristic  will  be  used  which  may  not  always  pick  the 
fastest. 
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The  rest  of  this  section  concentrates  on  improving  execution  time  in  reduction  to 
tridiagonal  form.  Back  transformation  is  already  very  efficient  and  lienee  leaves  less  room 
for  improvement.  We  leave  the  tridiagonal  eigensolver  to  others[139].  Figure  8.1  gives  a 
top-level  description  of  the  next  ScaLAPACK  symmetric  eigensolver. 


Figure  8.1:  Data  redistribution  in  the  next  ScaLAPACK  symmetric  eigensolver 
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8.2  Reduction  to  tridiagonal  form  in  the  next  ScaLAPACK  sym¬ 
metric  eigensolver 

Figure  8.2  shows  how  the  data  layout  for  reduction  to  tridiagonal  form  will  be 
chosen.  The  data  layout  and  the  code  used  for  reduction  to  tridiagonal  form  must  be 
chosen  in  tandem. 

Although  the  new  PDSYTRD  has  three  variants,  they  all  share  the  same  pattern  of 
communication  and  computation  shown  in  figure  8.3. 

Message  initiations  are  reduced  by  using  techniques  first  used  in  HJS,  and  several 
new  ones.  HJS  stores  V  and  W  in  a  row-distributed/column- replicated  manner  which  avoids 
to  need  to  broadcast  them  repeatedly.  HJS  also  keeps  the  number  of  messages  small  by 
combining  messages  wherever  possible. 

Our  communication  pattern  has  three  advantages  over  HJS:  it  requires  fewer  mes¬ 
sages,  does  not  risk  over /underflow  and  uses  only  the  BLACS  communication  primitives2. 
The  manner  in  which  we  compute  the  Householder  vector  requires  the  same  number  of 
message  initiations  as  the  HJS,  but  avoids  the  risk  of  over /underflow  in  the  computation  of 
the  norm.  We  use  fewer  messages  than  HJS  because  we  update  w  in  a  novel  manner  (see 

2Whether  the  right  communication  primitives  were  chosen  for  the  BLACS  may  be  debatable,  but  they  are 
what  is  available  for  use  within  ScaLAPACK. 
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Figure  8.2:  Choosing  the  data  layout  for  reduction  to  tridiagonal  form 
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discussion  of  Line  4.1  below)  and  we  delay  the  spread  of  w  (which  HJS  naturally  performs 
at  the  bottom  of  the  loop)  to  the  top  of  the  loop  so  that  it  can  be  spread  in  the  same 
message  that  spreads  v. 

Our  communication  pattern  has  one  disadvantage  over  HJS:  it  requires  redundant 
computation  in  the  update  of  w.  The  discussion  of  Line  4.1  below  explains  that  we  can 
choose  to  eliminate  this  redundant  computation  by  increasing  the  number  of  messages. 

Line  2.1  in  Figure  8.3  In  Section  8.4.1  we  show  how  to  avoid  overflow  while  using  just 
2  n  log(  y/p)  messages. 

Lines  3.2  and  3.6  in  Figure  8.3  Only  2  messages  are  required  to  transpose  a  matrix 
when  a  square  processor  layout  is  used.  Each  processor,  (a,  b)  must  sends  a  message 
to,  and  receive  a  message  from,  its  transpose  processor  (b,a).  The  required  time  is: 

n 

2(  a  +  2  n'  [3 )  =  2  n  a  +  2  n2  /3 

n'— 1 

Line  4.1  in  Figure  8.3  w  =  w  —  WVTv  —  VWTv  can  be  computed  in  a  number  of 
ways.  W,V  and  v  are  distributed  across  processor  rows  and  replicated  across  proces- 
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Figure  8.3:  Execution  time  model  for  the  new  PDSYTRD.  Line  numbers  match  Fig¬ 
ure  4.5(PDSYTRD  execution  time)  where  possible. 
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sor  columns.  WT ,VT  and  vT  are  distributed  across  processor  columns  and  replicated 
across  processor  rows.  Furthermore,  since  only  the  partial  sums  contributing  to  w  are 
known,  the  updates  to  w  can  be  made  on  any  processor  column,  and  even  spread  across 
various  processor  columns.  Appendix  B.l  how  this  update  is  performed  without  com¬ 
munication  and  shows  that  there  are  a  range  of  options  which  trade  off  communication 
and  load  imbalance. 


Line  1.1  in  Figure  8.3  updates  the  current  block  column.  This  can  be  implemented  in 
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several  ways.  LAPACK’s  DSYTRD  uses  a  right  looking  update3  because  a  matrix-matrix 
multiply  is  more  efficient  than  an  outer  product  update.  HJS  uses  a  left  looking  update 
because  on  their  cyclic  data  layout,  the  left  looking  update  allows  all  processors  to  be 
involved,  reducing  load  imbalance. 

Line  5.1  in  Figure  8.3  Computing  c  =  w  vT  requires  summing  c  within  a  processor  col¬ 
umn.  In  order  to  compute  w  in  Line  5.1,  c  must  be  known  throughout  a  processor 
column.  To  allow  w  and  v  to  be  broadcast  in  the  same  message  (Line  3.1),  c  is  summed 
and  broadcast  in  the  column  that  owns  column  i  +  1  of  the  matrix. 

Line  6.1  in  Figure  8.3  No  communication  is  required  here.  W,VT  and  WT  are  already 
replicated  as  necessary. 

8.3  Making  the  ScaLAPACK  symmetric  eigensolver  easier  to 
use 

The  next  ScaLAPACK  symmetric  eigensolver  will  separate  internal  data  layout  from 
external  data  layout  while  executing  50%  faster  than  PDSYEVX  on  a  large  range  of  problem 
sizes  on  most  distributed  memory  parallel  computers  and  requiring  less  memory.  Separating 
internal  and  external  data  layout  allows  the  user  to  choose  whatever  data  layout  is  most 
appropriate  for  the  rest  of  their  code  and  to  use  that  data  layout  regardless  of  the  problem 
size  and  computer  they  are  using.  Separating  internal  and  external  data  layouts  also  makes 
it  easy  for  the  ScaLAPACK  symmetric  eigensolver  to  add  support  for  additional  data  layouts. 
However,  while  these  ease-of-use  issues  are  the  most  important  advantages  of  separating 
internal  and  external  data,  we  will  focus  further  discussion  on  how  this  separation  improves 
performance. 

8.4  Details  in  reducing  the  execution  time  of  the  ScaLAPACK 
symmetric  eigensolver 

Separating  internal  and  external  data  layout  will  improve  the  performance  of 
PDSYEVX  by  allowing  PDSYEVX  to  use  different  data  layouts  for  different  tasks,  and  by  allow- 

“A  right  looking  update  updates  the  current  column  with  a  matrix-matrix  multiply.  A  left  looking  update 
updates  every  column  in  the  block  column  with  an  outer  product  update. 
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ing  PDSYEVX  to  concentrate  only  on  the  most  efficient  data  layout  for  each  task.  A  reduction 
to  tridiagonal  form  which  only  works  on  a  cyclic  data  layout  on  a  square  processor  grid  will 
not  only  have  lower  overhead  and  load  imbalance  than  the  present  reduction  to  tridiagonal 
form,  but  will  be  able  to  incorporate  techniques  that  would  be  prohibitively  complicated  if 
they  were  implemented  in  a  code  that  must  support  all  data  layouts. 

Significant  reduction  of  the  execution  time  in  PDSYEVX,  the  ScaLAPACK  symmetric 
eigensolver,  requires  that  all  four  sources  of  inefficiency  (message  latency,  message  trans¬ 
mission,  software  overhead  and  load  imbalance)  be  reduced.  Fortunately,  as  Hendrickson, 
Jessup  and  Smitli[91]  have  shown,  all  of  these  can  be  reduced.  PDSYEVX  sends  3  times  as 
many  messages  as  necessary4,  and  require  3  times  as  much  message  volume  as  well5.  Over¬ 
head  and  load  imbalance  costs  are  harder  to  quantify.  Load  imbalance  costs  will  be  reduced 
by  using  data  layouts  appropriate  to  each  task6.  If  necessary,  load  imbalance  costs  can  be 
further  reduced  at  the  expense  of  increasing  the  number  of  messages  sent.  Overhead  will 
be  reduced  by  eliminating  the  PBLAS,  reducing  the  number  of  calls  to  the  BLAS  and,  where 
a  sufficiently  good  compiler  is  available,  eliminating  the  calls  to  the  BLAS  entirely. 

8.4.1  Avoiding  overflow  and  underflow  during  computation  of  the  House¬ 
holder  vector  without  added  messages 

Overflow  and  underflow  can  be  avoided  during  the  computation  of  the  Householder 
vector  without  added  messages  by  using  the  pdnrm2  routine  to  broadcast  values.  The  eas¬ 
iest  way  to  compute  the  norm  of  a  vector  in  parallel  is  to  sum  the  squares  of  the  elements. 
However,  this  will  lead  to  overflow  if  the  square  of  one  of  the  elements  or  one  of  the  inter¬ 
mediate  values  are  greater  than  the  overflow  threshold  (likewise  underflow  occurs  if  one  or 
more  of  the  squares  of  the  elements  or  the  intermediate  vallues  is  less  than  the  underflow 
threshold).  The  ScaLAPACK  routine  pdnrm2  avoids  underflow  and  overflow  during  reduc¬ 
tions  by  computing  the  norm  directly  leaving  the  result  on  all  processors  in  the  processor 
column.  The  requires  21g(pr)cv  execution  time.  In  PDSYTRD,  a  =  A(t-|-l,t)  is  broadcast 

4PDSYEVX  uses  17  n  logfy/p),  HJS  uses  9  n  log(y^j),  we  will  show  that  this  can  be  reduced  to  5  n  log(y^i) 
but  do  not  claim  that  this  is  minimal. 

5PDSYEVX  sends  (5  logfy/p)  +  2)  n2/^/p  elements  per  processor  and  HJS  reduces  this  to  ( \  log(y^i)  + 
V )  n~ / \/P  elements  per  processor.  The  design  I  suggest  requires  ( §  log(  yT)  +  elements  per  processor 

but  requires  fewer  messages. 

6 Statically  balancing  the  number  of  eigenvectors  assigned  to  each  processor  column  will  reduce  load 
imbalance  in  back  transformation.  Using  a  smaller  block  size  will  reduce  load  imbalance  in  reduction  to 
tridiagonal  form 
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to  all  processors  in  the  processor  column,  this  requires  21g(pr)cv  execution  time.  In  HJS, 
they  sum  the  squares  of  the  elements  and  broadcast  a  =  /i(*  +  l,  «)  at  the  same  time  by 
summing  an  additional  value  in  the  reduction.  All  processors  except  for  the  processor  that 
owns  A(*  +  l,«)  contribute  0  to  the  sum  while  the  processor  owning  A(*  +  l,«)  contributes 
A(  t  +  1,  i). 

In  the  new  PDSYEVX,  we  will  employ  this  trick,  to  broadcast  cv  at  the  same  time 
as  the  norm  is  computed.  It  is  slightly  more  complicated  because  norm  computations  do 
not  preserve  negative  numbers.  Hence,  we  compute  two  norms:  max(0,cv)  and  max(0,  —  cv), 
from  these  a  is  easily  recovered.  Ideally,  we  need  a  new  PBLAS  or  BLACS  routine  which 
would  simultaneously  compute  a  norm  and  broadcast  both  it  and  other  values. 


8.4.2  Reducing  communications  costs 

Communications  costs  can  be  reduced  in  both  reduction  to  tridiagonal  form  and 
back  transformation  but  by  vastly  different  methods.  PDSYTRD,  ScaLAPACK’s  reduction  to 
tridiagonal  form  code,  will  use  a  cyclic  data  layout  on  a  square  processor  grid  to  simplify 
the  code,  allowing  PDSYEVX  to  use  the  techniques  demonstrated  by  Hendrickson,  Jessup  and 
Smitli[91]:  direct  transpose,  a  column  replicated/row  distributed  data  layout  for  interme¬ 
diate  matrices  and  combining  messages.  In  addition,  PDSYTRD  will  delay  the  last  operation 
in  the  loop  to  combine  it  with  the  first,  reducing  the  number  of  messages  per  loop  iteration 
from  6  to  5. 

Communication  costs  will  be  reduced  in  back  transformation  by  using  a  rectangular 
grid  and  a  relatively  large  block  size.  Most  of  the  communication  in  back  transformation 
is  within  processor  columns,  and  the  communication  within  processor  columns  cannot  be 
pipelined  (meaning  that  it  grows  as  log(pr)),  lienee  setting  pc  to  be  substantially  larger 
(roughly  4-8  times  larger)  than  pr  will  cut  message  volume  nearly  in  half  compared  to  the 
message  volume  required  for  a  square  processor  grid. 

Communications  cost  could  be  reduced  further  on  select  computers  by  writing  ma¬ 
chine  specific  BLACS  implementations'  ,  but  I  don’t  think  that  the  benefit  will  justify  the 

‘Karp  et  al.[107]  proved  that  a  broadcast  or  reduction  of  k  elements  on  px  processors  can  be  executed 
in  log (pjr )  ct  +  k  ft.  Equally  importantly,  the  latency  term  can  be  reduced  significantly  by  machine  specific 
code  because  latency  is  primarily  a  software  cost,  the  actual  hardware  latency  is  typically  less  than  one 
tenth  of  the  total  observed  latency.  I  believe  that  by  coding  broadcasts  and  reductions  in  a  machine  specific 
manner,  I  could  reduce  the  latency  to  (  a  software  +  log(pa. )  a  hardware-  It  might  be  possible  to  achieve  a 
similar  result  using  active  messages.  Machine  specific  optimization  of  the  BLACS  broadcast  and  reduction 
codes  is  attractive  because  it  would  benefit  all  of  the  ScaLAPACK  matrix  transformation  codes.  However, 
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cost.  In  PDSYEVX  as  shipped  in  version  1.5  of  ScaLAPACK,  software  overhead  and  load  imbal¬ 
ance  are  roughly  twice  as  high  as  communications  cost  on  the  PARAGON.  The  new  PDSYEVX 
should  reduce  communications  by  at  least  a  factor  of  2,  and  though  I  hope  it  will  reduce 
software  overhead  and  load  imbalance  by  close  to  a  factor  of  4,  overhead  and  load  imbalance 
will  probably  remain  larger  than  communications  cost.  The  fact  that  communication  costs 
is  not  the  dominant  factor  limiting  efficiency  limits  the  improvement  that  one  can  expect 
from  machine  specific  BLACS  implementations. 

Communications  cost  in  back  transformation  could  be  reduced  further  by  overlap¬ 
ping  communication  and  computation  and/or  using  an  all- to- all  broadcast  pattern  instead 
of  a  series  of  broadcasts.  Back  transformation  enjoys  the  luxury  of  being  able  to  compute 
the  majority  of  wliat  it  needs  to  communicate  in  advance.  This  allows  many  possibili¬ 
ties  for  reducing  the  communications  bandwidth  cost.  The  fact  that  message  latency,  load 
imbalance  and  software  overhead  costs  are  modest  in  back  transformation  means  that  a 
reduction  in  the  communications  bandwidth  cost  ought  to  result  in  significant  performance 
improvement  in  back  transformation.  However,  overlapping  communication  and  computa¬ 
tion  has  historically  offered  less  benefit  than  in  practice  than  in  theory,  (see  section  B.1.6) 
so  I  approach  this  with  caution  and  will  not  pursue  it  without  first  convincing  myself  that 
the  benefit  is  significant  on  several  platforms. 

8.4.3  Reducing  load  imbalance  costs 

Load  imbalance  can  be  reduced  in  both  reduction  to  tridiagonal  form  and  back 
transformation  by  careful  selection  of  the  block  size.  The  number  of  messages  in  reduction 
to  tridiagonal  form  is  not  dependent  on  the  data  layout  block  size,  lienee  a  cyclic  data 
layout  (i.e.  block  size  of  1)  will  be  used,  reducing  load  imbalance.  The  fact  that  only  half 
of  the  flops  in  reduction  to  tridiagonal  form  are  BLAS3  flops  and  the  large  number  of  load 
imbalanced  row  operations  combine  to  make  the  optimal  algorithmic  block  size  for  reduction 
to  tridiagonal  form  small. 

Load  imbalance  is  minimized  in  back  transformation  by  choosing  a  block  size 
which  assigns  a  nearly  equal  number  of  eigenvectors  to  each  column  of  processors  (nb  = 
\n/(kpc) ]  for  some  small  integer  k).  A  block  cyclic  data  layout  reduces  execution  time 
in  back  transformation  by  reducing  the  number  of  messages  sent,  lienee  we  must  look  for 


purely  from  the  point  of  view  of  improving  the  performance  of  the  ScaLAPACK  symmetric  eigensolver  this 
effort  probably  would  not  be  worth  the  effort. 
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other  ways  to  reduce  load  imbalance.  Fortunately,  all  eigenvectors  must  be  updated  at  each 
step,  lienee  a  good  static  load  balance  of  eigenvectors  across  processor  columns  eliminates 
most  of  the  load  imbalance  in  back  transformation.  The  load  imbalance  within  each  column 
of  processors  is  less  important  because  the  number  of  processor  rows  will  be  small.  The 
computation  of  T  can  be  performed  simultaneously  on  all  processor  columns,  eliminating 
the  load  imbalance  in  that  step. 

8.4.4  Reducing  software  overhead  costs 

There  are  many  ways  to  reduce  software  overhead,  but  software  overhead  is  poorly 
understood  and  lienee  it  is  hard  to  predict  which  method  will  be  best.  Hendrickson,  Jessup 
and  Smitli[91]  showed  that  using  a  cyclic  data  layout  and  a  square  processor  grid  reduces  the 
number  of  DTRMV  calls  from  0(n2 / nb)  to  0(  n)  because  each  local  matrix  is  triangular.  Using 
lightweight  (no  error  checking,  minimal  overhead)  BLAS  would  reduce  software  overhead,  but 
these  are  still  in  the  planning  stages.  If  the  compiler  produces  efficient  code  for  a  simple 
doubly  nested  loop,  software  overhead  can  be  further  reduced  by  using  a  compiled  code 
instead  of  calls  to  the  BLAS.  Peter  Strazdins  has  shown  that  software  overhead  within  the 
PBLAS  can  be  reduced  up  to  50%[161,  160].  Alternatively,  eliminating  the  PBLAS  entirely 
would  eliminate  the  overhead  associated  with  the  PBLAS.  I  would  prefer  to  reduce  the  PBLAS 
overhead  and  continue  to  use  the  PBLAS.  But,  that  is  likely  to  be  much  harder  than  simply 
abandoning  the  PBLAS. 

When  PDSYTRD,  ScaLAPACK’s  reduction  to  tridiagonal  form,  was  written  the  PBLAS 
did  not  support  column- replicated/row-distributed  matrices  or  algorithmic  blocking.  Hence, 
many  of  the  ideas  mentioned  here  for  improving  the  performance  of  PDSYTRD  were  not 
available  to  a  PBLAS-based  code.  PBLAS  version  2  now  offers  these  capabilities. 

Software  overhead  cannot  be  measured  separate  from  other  costs  and  is  lienee 
difficult  to  measure,  understand  and  reason  about.  It  varies  widely  from  machine  to  machine 
and  can  change  just  by  changing  the  order  in  which  subroutines  are  linked.  We  do  not, 
for  example,  know  how  much  can  be  attributed  to  subroutine  calls,  how  much  is  caused 
by  error  checking,  how  much  is  caused  by  loop  unrolling  and  how  much  is  caused  by  code 
cache  misses. 

A  good  compiler  should  be  able  to  compute  the  local  portion  of  Av  faster  than 
two  calls  to  DTRMV  because  a  simple  doubly  nested  loop  could  access  each  element  in  the 
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local  portion  of  A  only  once  whereas  two  calls  to  DTRMV  would  require  that  each  element 
in  A  be  read  twice.  The  result  is  that  the  ratio  of  flops  to  main  memory  reads  is  4-to-l 
in  the  doubly  nested  loop  versus  2-to-l  in  DTRMV8.  Furthermore,  a  compiled  kernel  would 
avoid  the  BLAS  overhead  and  might  involve  less  loop  unrolling  -  reducing  overhead  directly 
and  reducing  code  cache  pressure  as  well.  However,  compiler  technology  is  uneven,  so  we 
would  make  using  compiled  code  instead  of  the  BLAS  optional. 

Unblocked  reduction  to  tri diagonal  form  will  likely  be  faster  than  blocked  reduc¬ 
tion  to  tri  diagonal  form  on  problem  sizes  where  software  overhead  is  the  dominant  cost. 
Unblocked  reduction  to  tridiagonal  form  on  a  cyclic  data  layout  eliminates  load  imbalance, 
requires  a  minimum  of  communication  and  software  overhead.  The  only  disadvantage  is 
that  all  of  the  4/3  n3  flops  are  BLAS2  flops.  However,  with  a  good  compiler,  these  BLAS2 
flops  can  perform  well  on  most  computers.  The  kernel  in  an  unblocked  reduction  to  tri  diag¬ 
onal  form  involves  8  flops  to  each  read-modify- write  memory  access9.  Most  computers  have 
adequate  main  memory  bandwidth  to  handle  this  at  full  speed.  However,  not  all  compilers 
are  good  enough  yet. 

8.5  Separating  internal  and  external  data  layout  without  in¬ 
creasing  memory  usage 

Separating  internal  and  external  data  layout  will  require  memory-intensive  data  re¬ 
distribution,  but  making  the  data  redistribution  codes  more  space  efficient  will  save  enough 
memory  space  to  offset  the  memory  needs  of  separating  internal  and  external  data  lay¬ 
out.  Data  redistributions  between  two  data  layouts  with  different  values  of  pr,pc  or  nb  use 
messages  of  0(n2 /(p3/2)  +  nb2)  data  elements.  However,  degenerate  data  redistributions 
between  two  data  layouts  with  the  same  values  of  pr,pc  or  nb  use  messages  of  roughly 
n2 Ip  elements.  In  order  to  avoid  treating  degenerate  data  redistributions  separately,  the 
current  redistribution  codes  require  n2/p  buffer  space  for  all  redistributions.  Splitting  one 
large  message  into  several  smaller  ones  is  not  conceptually  difficult  but  will  require  that  the 
code  be  rewritten  and  the  testing  will  have  to  be  augmented  to  properly  exercise  the  new 
paths.  However,  the  execution  time  will  not  be  significantly  affected.  Both  PDLARED2D,  the 

8These  ratios  are  8-to-l  and  4-to-l  respectively  for  Hermitian  matrices. 

9The  ratio  for  reducing  Hermitian  matrices  to  tridiagonal  form  is  16  flops  per  read-modify-=write 
operation. 
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eigenvector  redistribution  routine,  and  DGMR2D,  the  general  purpose  redistribution  routine, 
will  have  to  be  modified. 

If  the  redistribution  routines  are  not  modified  as  described  above,  memory  usage 
would  increase  from  4 n2/p  to  6 n2/p,  and  run  a  remote  risk  of  causing  the  eigensolver  to 
crash.  While  both  PDLARED2D  and  DGMR2D  require  n2  /p  space  and  could  use  the  same 
space,  they  do  not.  PDLARED2D  uses  space  passed  to  it  in  the  WORK  array,  while  DGMR2D  calls 
malloc  to  allocate  space.  The  eigensolver  could  crash  if  a  message  of  n2/p  elements  were 
sent,  and  the  communication  system  was  unable  to  allocate  a  buffer  of  that  size.  Messages  of 
that  size  are  not  required  during  normal  ScaLAPACK  eigensolver  tests,  lienee  the  eigensolver 
could  crash  during  regular  use  even  after  passing  all  tests  and  after  months  or  even  years 
of  flawless  service.  Modifying  the  redistribution  routines  as  we  propose,  eliminates  this 
potential  problem. 

Memory  needs  could  be  reduced  from  4 n2 /p  to  'in2 /p  by  using  the  space  allocated 
to  the  input  matrix,  A,  and  the  output  matrix,  Z,  as  internal  workspace.  This  would 
require  a  modification  to  the  present  calling  sequence,  probably  in  the  form  of  a  new  data 
descriptor.  However,  reducing  memory  usage  by  25%  may  not  justify  a  change  to  the  calling 


sequence. 
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Chapter  9 

Advice  to  symmetric  eigensolver 
users 


Parallel  dense  tridiagonal  eigensolvers  should  be  used  if  none  of  the  following 
counter  indications  hold.  Use  a  serial  eigensolver  if  the  problem  is  small  enough  to  fit1.  Use 
a  sparse  eigensolver  if  your  input  matrix  is  sparse2  and  you  don’t  need  all  the  eigenvalues 
or  if  the  matrix  is  dense  and  you  only  need  a  small  fraction  of  the  eigenvalues.  Use  a 
Jacobi  eigensolver  if  you  need  to  compute  small  eigenvalues  of  a  scaled  diagonally  dominant 
matrix  (or  a  matrix  satidying  one  of  the  other  properties  described  by  Demmel  et  al. [56] ) 
accurately.Use  a  Jacobi  eigensolver  for  small  (n  <  100^/p)  spectrally  diagonally  dominant 
matrices3. 

Currently  the  three  most  readily  available  parallel  dense  symmetric  eigensolvers 
are  PeIGs  and  ScaLAPACK’s  PDSYEV  and  PDSYEVX.  PeIGs  and  PDSYEV  maintain  orthogonality 
among  eigenvectors  associated  with  clustered  eigenvalues.  PeIGs  and  PDSYEVX  are  faster 
than  PDSYEV.  PDSYEVX  scales  better  than  either  PeIGs  or  PDSYEV. 

The  choice  between  PeIGs  and  ScaLAPACK  is  probably  more  a  matter  of  which 
infrastructure4  is  preferred  and  is  out  of  the  scope  of  this  thesis.  Furthermore,  it  is  likely 
that  PeIGs  will  at  some  point  use  the  ScaLAPACK  symmetric  eigensolver  in  the  future.  Hence, 

1i.e.  if  memory  allows 

2The  break-even  point  is  not  known,  so  I  suggest  that  if  your  matrix  is  less  than  10%  non-zero  and  you 
need  less  than  10%)  of  the  eigenvalues  you  should  use  a  sparse  eigensolver. 

"  Spectrally  diagonally  dominant  means  that  the  eigenvector  matrix,  or  a  permutation  thereof,  is  diago¬ 
nally  dominant. 

4PeIGs  is  built  on  top  of  Global  Arrays[101]  while  ScaLAPACK  is  built  on  the  BLACS  or  MPI. 
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the  upgrade  path  for  both  may  end  up  with  the  same  underlying  code.  If  you  are  not  likely 
to  use  more  than  32  processors,  PeIGs  performance  should  be  acceptable5.  If  your  input 
matrices  do  not  include  large  clusters  of  eigenvalues  or  if  you  can  accept  lion-orthogonal 
eigenvectors,  PDSYEVX  is  the  right  choice.  Otherwise,  i.e.  if  your  input  matrix  has  large 
clusters  of  eigenvalues  for  which  you  need  orthogonal  eigenvectors,  and  you  wish  to  use 
more  than  32  processors,  PDSYEV  is  the  right  choice.  Eventually,  the  imporved  version  of 
PDSYEVX  described  in  Chapter  8  will  be  the  method  of  choice  in  all  cases. 


5Since  PeIGs  uses  a  ID  data  layout,  its  performance  will  degrade  if  you  use  more  than  32  processors. 
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Variables  and  abbreviations 
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Table  A.l:  Variable  names  and  tlieir  uses 


Name 

Meaning 

(a,b) 

The  processor  in  processor  row  a  and  processor  column  b. 

A 

The  input  matrix  (partially  reduced). 

Mi,  3) 

The  i,j  element  in  the  (partially  reduced)  matrix  A. 

c 

The  number  of  eigenvalues  in  the  largest  cluster  of  eigenvalues. 

c 

The  set  of  all  processor  columns. 

Gq, 

The  current  processor  column  within  the  sub-grid. 

Cb 

The  current  processor  column  sub-grid. 

e 

The  number  of  eigenvalues  required. 

i 

The  current  column,  A(j  :  n,j  :  n)  being  the  un-reduced  portion 
of  the  matrix. 

i' 

The  column  within  the  current  block  column,  j'  =  mod  (j, nb) 

\g(^P) 

log2  \/P 

m 

The  number  of  eigenvectors  required. 

mb 

The  row  block  size.  Used  only  when  we  discuss  rectangular 
blocks.  In  general,  the  row  block  size  and  column  block  size 
are  assumed  to  be  equal  and  are  written  as  nb. 

mullen 

A  compile  time  parameter  in  the  PBLAS  which  controls  the  panel 
size  used  in  PBLAS  symmetric  matrix  vector  multiply  routine, 

PDSYMV. 

n 

The  size  of  the  input  matrix  A. 

nb 

f  ile  blocking  factor.  In  PDSYEVX  the  data  layout  and  algorithmic 
blocking  factor  are  the  same.  In  HJS  the  data  layout  blocking 
factor  is  1  and  nb  refers  to  the  algorithmic  blocking  factor. 

P 

The  number  of  processors  used  in  the  computation. 

pbf 

Tanei  blocking  factor.  The  panel  width  used  in  DGEMV  in  PDSYEVX 
and  DGEMM  in  PDSYEVX  and  HJS  is  pbf  X  nb. 

Pr 

The  number  of  processor  rows  in  the  process  grid. 

Prl 

The  number  of  processor  rows  in  a  sub-grid. 

Pr-2 

The  number  of  processor  sub-grid  rows. 

Pc 

The  number  of  processor  columns  in  the  process  grid. 

Pci 

The  number  of  processor  columns  in  a  sub-grid. 

Pc2 

The  number  of  processor  sub-grid  columns. 

R 

The  set  of  all  processor  rows. 

ra 

The  current  processor  row  within  the  sub-grid. 

n 

The  current  processor  row  sub-grid. 

spread 

In  a  “spread  across  ",  every  processor  in  current  processor  col¬ 
umn  broadcasts  to  every  other  processor  in  the  same  processor 
row.  In  a  “spread  down”,  every  processor  in  current  processor 
row  broadcasts  to  every  other  processor  in  the  same  processor 
column. 

tril(  A,  0) 

I'he  lower  triangular  part,  including  the  diagonal,  oi  the  un¬ 
reduced  part  of  the  input  matrix  A,  i.e.  A(j  :  n,j  :  n) 

tril(  A,  —1) 

The  lower  triangular  part,  excluding  the  diagonal,  of  the  un- 

reduced  part  of  the  input  matrix  A,  i.e.  A(j  :  n,j  :  n) 
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Table  A. 2:  Variable  names  and  tlieir  uses  (continued) 


Name 

Meaning 

V 

The  vector  portion  of  the  householder  reflector. 

V 

The  current  column  of  householder  reflectors.  Size:  n  —  j  +  j'  by 

V{j  -  j'  :  n,  1  :  /) 

The  current  column  of  householder  reflectors.  Size:  n  —  j  A  j'  by 

f- 

vnb 

llie  imbalance  in  the  2D  block-cyclic  distribution  oi  the  eigen¬ 
vector  matriy. 

w 

The  companion  update  vector,  i.e.  the  vector  used  in  A  =  A  — 
vwT  —  WvT  to  reduce  A 

w 

The  current  column  of  companion  update  vectors.  Size:  n—j  +j' 
by  f. 

W(j-f  :n,l:  j') 

The  current  column  of  companion  update  vectors.  Size:  n—j  +j' 

byjb _ 

Abbreviation 

Meaning 

CPU 

Central  Processing  Unit 

FPU 

Floating  Point  Unit 

Table  A. 3:  Abbreviations 


Symbol 

Meaning 

Terms  included 

a 

llie  message  initiation  cost  lor  BLACS  send  and  re¬ 
ceive. 

nlg(p),n 

p 

The  inverse  bandwidth  cost  for  BLACS  send  and  re¬ 
ceive. 

nZ  lg(p)  n 2 

V?  ’  V?’ 

raxnb  lg(p) 

^3 

DGEMM  (matrix- matrix  multiply)  subroutine  overhead 
plus  the  time  penalty  associated  of  invoking  DGEMM  on 
small  matrices. 

n 2  n 

nb2  x  pbf  ’  nb 

73 

Time  required  per  DGEMM  (matrix- matrix  multiply) 
flop. 

n3  n2xnb 
p  ’  y? 

h 

DGEMV  (matrix- vector  multiply)  subroutine  overhead 
plus  the  time  penalty  associated  of  invoking  DGEMV  on 
small  matrices. 

n 

72 

Time  required  per  DGEMV  (matrix- vector  multiply) 
flop. 

n3  ra2Xnb 

P  ’  y? 

7-r 

Time  required  per  divide. 

~A? 

— 

V  ’ 

Time  required  per  square  root. 

7i 

Time  required  per  BLAS1  (scalar- vector)  flop. 

n2 

- .71 

p  ’ 

Si 

Subroutine  overhead  for  BLAS1  and  similar  codes. 

n2 

S4 

Subroutine  overhead  for  the  PBLAS. 

n 

Table  A. 4:  Model  costs 


172 


Appendix  B 

Further  details 

B.l  Updating  v  during  reduction  to  tridiagonal  form 

Line  4.1,  w  =  w  —  WVTv  —  VWTv  in  Figure  8.3  can  be  computed  with  minimal 
communication,  minimal  computation  or  with  an  intermediate  amount  of  both  commu¬ 
nication  and  computation.  Indeed,  Line  4.1  can  be  computed  with  0((~  +  72  + 

??.  log(pr-0'5  )a )  cost  for  various  r  E  [0.5, 1.0].  r  =  1.0  corresponds  to  the  minimal  computa¬ 
tion  cost  option  (discussed  in  section  B.l. 3)  while  r  =  0.5  corresponds  to  the  minimal  (zero) 
communication  cost  option  (discussed  in  section  B.1.2).  Section  B.l. 4  describes  the  inter¬ 
mediate  options  in  a  generalized  form  which  includes  both  the  minimum  communication 
and  minimum  computation  options  as  special  cases. 

The  plethora  of  options  for  the  update  of  v  stems  from  the  fact  that  the  input  ma¬ 
trices  W,  V,  WT  and  VT  are  replicated  across  the  relevant  processors  while  the  input/output 
vector  v  is  stored  as  partial  sums  across  the  processor  columns  in  each  of  the  processor  rows. 
The  input  matrices  are  replicated  because  they  will  need  to  be  replicated  later  to  update 
A.  The  vector  v  is  stored  as  partial  sums  because  that  is  how  it  is  initially  computed, 
and  because  the  combine  operation  used  to  compute  v  from  the  partial  sums  has  not  been 
performed  at  this  point. 

Throughout  this  section  we  only  discuss  computing  WVT v.  VWT v  can  be  com¬ 
puted  in  a  similar  manner.  Moreover,  the  two  computations,  and  all  associated  communi¬ 
cation,  can  be  merged  to  reduce  software  overhead  and  message  latency  costs. 
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B.1.1  Notation 


In  describing  most  parallel  linear  algebra  codes,  including  all  codes  in  this  thesis 
outside  of  this  appendix,  we  need  not  explicitly  state  the  processor  on  which  a  value  is 
stored.  A;j  is  understood  to  live  on  the  processor  that  owns  row  i  and  column  j.  The 
rib'  element  array  tmp  contains  different  values  on  different  processors.  Therefore,  for  the 
discussion  in  this  appendix,  an  additional  subscript  is  added  to  tmp  to  indicate  the  processor 
column.  Furthermore,  some  entries  in  tmp  are  left  undefined  at  various  stages,  therefore  we 
use  j  G  {ca}  to  indicate  all  columns  j  owned  by  processor  column  ca.  i.e.  tmPje{ca},ca  =  va l 
means  that  Vj  G  {ca},  tmpj  on  processor  ca  is  assigned  val.  For  extra  clarity  within  a 
display  we  write  this  as  tmp-  . 


B.1.2  Updating  v  without  added  communication 

Line  4.1,  w  =  w  —  WVTv  —  VWTv  in  Figure  8.3  can  be  computed  without 
any  communication  other  than  that  needed  to  compute  v  without  the  update.  It  initially 
appears  that  w  =  w  —  W  ■  VT v  —  V  ■  WT v  requires  communication  because  computing 
tmp  =  VT v  requires  summing  rib'  values1  within  each  processor  column,  and  computing 
w  =  w  —  W  ■  tmp  requires  that  tmp  be  broadcast  within  each  processor  column.  However, 
W  ■  VT v  can  be  computed  with  a  single  sum  within  each  processor  row,  and  by  delaying 
the  sum  needed  to  compute  w,  one  of  them  can  be  avoided  completely.  Figure  B.l  derives 
how  W  ■  VT v  can  be  computed  with  a  single  sum  within  each  processor  row. 

Line  3  The  transformation  from  line  2  to  line  3  is  the  standard  way  that  a  matrix  vector 
multiply  is  performed  in  parallel.  The  leftmost  sum  is  the  local  portion,  the  middle 
sum  is  the  sum  over  all  processors  in  the  processor  column. 

Line  4  Delay  the  sum  over  all  processors  in  the  processor  column  until  after  multiplying 
by  W.  The  rightmost  two  sums  involve  only  local  values. 

Figure  B.2  shows  how  to  compute  W  ■  VT v  without  added  communication. 


Line  5  Local  computation  of  VT  ■  v.  Operations: 

n  nb 

E  E2^nb'^ 

i=r,nb  nb'=r  ’ 

1nb/  =  i  —  ii  —  1  is  the  number  of  columns  in  H 


1  2  nb 

\  n  —  72 
Pr 
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Figure  B.l:  Avoiding  communication  in  computing  W  ■  VT v 


tmp 

£ 

II 

(Line  1) 

tmp; 

—  L!  L/ 

(Line  2) 

l<j<nb/  k£{C} 

tmp; 

=  L/  L/  Hkjhk 

(Line  3) 

l<j<nb'  1  <R<pr 

ke{  c} 

tmp; 

=  L!  L/  Hkjhk 

(Line  4) 

cepe  ke{c} 

l<j<nb/ 


Line  6  Local  computation  ofW  -tmp.  Operations: 

n  nb 

Z  z4"b'^  =  i»2 


nb 


i=l, nb  nb'=l 


Pr 


2  "  72 

Pr 


Line  7  Effect  of  summing  res;  within  each  processor  row.  This  operation  is  merged  with 
the  unavoidable  summation  of  w  within  each  processor  row,  lienee  this  operation  is 
not  performed  and  has  no  cost. 


B.l. 3  Updating  w  with  minimal  computation  cost 

Figure  B.3  shows  how  W  ■  VTv  can  be  performed  with  only  0(-^=  +  )  com¬ 

putation  by  distributing  the  computation  of  tmp  =  VT  ■  v  and  w  =  w  +  W  ■  tmp  over  all 
the  processors.  Each  of  the  nb  columns  of  VT  is  assigned  to  one  processor  row,  lienee  each 
processor  row  is  assigned  columns  of  VT.  Each  processor  row  computes  the  portion  of 
VT ■  v  assigned  to  it,  leaving  the  answer  on  the  diagonal  processor  in  this  row.  The  diagonal 
processors  then  broadcast  the  -^J=  elements  of  VT  ■  v  which  they  own  to  all  of  the  processors 
within  their  processor  column.  Finally,  each  processor  computes  w  =  w  +  W  ■  tmp  for  the 
values  of  W  and  tmp  which  it  owns. 
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Figure  B.2:  Computing  W  ■  VT v  without  added  communication 


tmPj,c  = 


res^c 

ie{R} 


E  reSi’G 

c 


E  VTk,jvk 
ke{c} 

E  Wi’J  tmPj,G 

3 

22  22  '  Vk 

j  fcefc} 


E  E  VT^vk 

J  ke{c} 

1  <C<Pc 

E"  •  Er/  - g 


(Line  5) 

(Line  6) 


(Line  7) 


Line  8  Local  computation  of  LrT  •  i>.  Operations: 

n  nb  it  2  i 

i  nb  ,  n  nb 

8  =  1, nb  Pr  Pc  P 

Line  9  Combine  tmpjg^j  c  within  each  processor  column,  leaving  the  answer  on  the  di¬ 
agonal  processor.  Operations: 

n  nb  |  /  | 

E  E  log(Pc )  ( «  +  —  I3 )  =  n  log {pc)a  +  \ —  log(pc )/3 

i=l, nb  nb'=l  Pc  Pc 

Line  10  Broadcast  within  each  processor  row  from  the  diagonal  processor. 

Operations: 

n  nb  |  /  | 

E  E  !°g(Pc )  ( Q  +  —/3)  =  n  log(pc)a  +  log (pc)/3 

•  Pc  Pc 

8=1, nb  nb  =1 

Line  11  Local  computation  of  W  -tmp.  Operations: 

n  nb  .  /  o  i 

Ex  -  t  nb  ,  n  nb 

E  2 - 72  =  | - 72 

8=1  ,nb  nfcl  Pr  Pc  P 
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Figure  B.3:  Computing  W  ■  VT v  with  minimal  computation 
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(Line  8) 

(Line  9) 
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(Line  11) 
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Line  12  Effect  of  summing  res;  within  each  processor  row.  This  operation  is  merged  with 
the  unavoidable  summation  of  w  within  each  processor  row,  lienee  this  operation  is 
not  performed  and  has  no  cost. 

The  update  of  w  in  HJS  requires  similar  communication  and  computation  costs 
although  the  patterns  of  communication  are  quite  different.  HJS  uses  recursive  halving  to 
spread  the  result  of  tmp  =  VT v,  computes  W  ■  tmp  on  all  processors,  and  uses  recursive 
doubling  to  compute  w  while  simultaneously  spreading  it  to  all  processor  columns.  Although 
the  BLACS  do  not  offer  recursive  halving  and  recursive  doubling  operations  we  could  build 
them  out  of  BLACS  sends  and  receives  but  that  incurs  higher  latency  costs. 

B.1.4  Updating  w  with  minimal  total  cost 

Line  4.1,  w  =  w  —  WWTw  —  WWTw  in  Figure  8.3  can  be  computed  with 
0(  nprhrj-2  +  n  log(]/-°-5)Q)  cost  for  any  r  >  0.5.  On  a  high  latency  machine,  one  can 
reduce  the  total  number  of  messages  by  increasing  the  load  imbalance.  On  a  low  latency 
machine,  one  can  reduce  the  load  imbalance  by  using  more  messages.  The  two  options  de¬ 
scribed  in  the  preceding  sections  are  special  cases  of  the  general  case  of  methods  described 
in  this  section.  Section  B.1.2  corresponds  to  r  =  0.5.  Section  B.1.3  corresponds  to  r  =  1.0. 

This  method  has  not  been  implemented  and  lienee  has  not  been  proven  to  result 
in  decreased  execution  times  in  practice. 

Methods  corresponding  to  0.5  <  r  <  1.0  require  wliat  amounts  to  a  four  dimen¬ 
sional  processor  grid.  The  pr  xpc  processor  grid  is  divided  into  pr 2  Xpc 2  sub-grids  with  each 
sub-grid  consisting  of  pr\  X  pc\  processors.  We  restrict  our  attention  to  square  processor 
grids  and  square  processor  sub-grids,  lienee  pr  =  pc,pr  1  =  Pci  and  pr 2  =  pc 2.  Each  processor 
column  is  identified  by  a  pair  of  numbers,  (ca,c&),  s.t.  1  <  ca  <  pc\  arid  1  <  c&  <  pc 2.  Like¬ 
wise,  each  processor  row  is  identified  by  a  pair  of  numbers,  (ra,r&),  s.t.  1  <  ra  <  prl  arid 
1  <  rj,  <  pr2.  No  modifications  are  needed  to  the  BLACS  to  support  this  method  because 
each  processor  belongs  to  only  two  2  dimensional  processor  grids:  the  normal  two  dimen¬ 
sional  data  layout  and  a  two  dimensional  data  layout  containing  only  those  processors  in 
the  same  processor  sub-grid,  i.e.  with  the  same  and  c&. 

Figure  B.4  shows  the  general  method  for  updating  w  using  a  4  dimensional  data 
layout.  The  nb'  elements  of  tmp  are  distributed  over  the  prl  processor  rows  and  columns 
within  each  processor  block,  such  that  each  processor  row  and  column  owns  roughly  y- 
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elements  of  tmp. 

Figure  B.4:  Computing  W  ■  VT v  on  a  four  dimensional  processor  grid 
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(Line  13) 


(Line  14) 


(Line  15) 


(Line  16) 


(Line  17) 


B.1.5  Notes  to  figure  B.4 

Line  13  Local  computation  of  VT  ■  v.  Operations: 

n  nb  |  / 

E  E  -2~— 12  =  In2  nb /(pr  pcl )  72 

Line  14  Combine  tmpje_rraj.  (Ca  within  each  processor  sub-grid  column,  leaving  the  an- 
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swer  on  the  diagonal  processor  (i.e.  ra  =  ca)  within  each  sub-grid.  Operations: 

n  nb  |  f  , 

Y  Y  loS(^r )  ( a  +  —  /3 )  =  n  log(pcl  )a  +  log(pci  )/3 

i=l, nb  nb'=l  Pcl  Pcl 


Line  15  Broadcast  tmpje_rraj. ^Ca ^  within  each  processor  sub-grid  row  from  the  diagonal 
processor  in  that  sub-grid  row.  Operations: 

n  nb  |  /  | 

Y  Y  loS(^i )  ( a  +  —  /3 )  =  n  log(pcl  )a  +  log(pci  )/3 

■  -t  u  |  t  Pel  Pel 

i— l,nb  nb7=l 


Line  16  Local  computation  of  W  -tmp.  Operations: 


n  nb 

E  E 2 

*=1, nb  nb'=l 


—  nb'/Pci72 

Pr 


!  n2  nb 

2  - 72 

Pr  Pcl 


Line  17  Effect  of  summing  res;  within  each  processor  row.  This  operation  is  merged  with 
the  unavoidable  summation  of  w  within  each  processor  row,  lienee  this  operation  is 
not  performed  and  has  no  cost. 


B.1.6  Overlap  communication  and  computation  as  a  last  resort 

There  are  numerous  studies  showing  that  overlapping  communication  and  compu¬ 
tation  improves  performance,  but  most  of  them  show  only  modest  improvement.  Arbenz 
and  Slapnicar[9]  show  a  5%  improvement  by  overlapping  communication  and  computation 
while  Pourzandi  and  Tourancheau  show  a  6%  improvement.  Those  that  show  the  greatest 
improvement  combine  communication  and  computation  overlap  with  other  equally  impor¬ 
tant  techniques  such  as  pipelining  and  lookahead[32]. 

I  don’t  know  why  overlapping  communication  and  computation  leads  to  only  mod¬ 
est  improvements.  In  theory  it  ought  to  hide  most  of  the  communication  costs.  There  are 
several  possible  explanations,  all  of  which  presumably  contribute.  I  suspect  that  the  most 
important  reason  for  the  disappointing  savings  from  overlap  is  that  overhead  and  not  com¬ 
munication  costs  are  not  the  primary  factor  limiting  efficiency.  A  second  important  reason 
is  that  most  of  the  cost  of  communication  on  todays  distributed  memory  machines  is  the 
cost  of  moving  the  data  between  the  node  and  the  network,  not  moving  data  within  the 
network.  The  cost  of  moving  data  to  and  from  the  node  always  involves  main  memory  cy¬ 
cles,  unless  the  main  memory  is  dual  ported  (i.e.  expensive),  which  must  be  stolen  from  the 
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execution  of  the  rest  of  the  code.  Further  the  latency  cost  is  almost  all  software  overhead, 
lienee  during  the  message  setup  the  epu  is  busy  and  cannot  compute. 

The  disadvantage  to  communication  and  computation  overlap  is  that  it  adds  com¬ 
plexity  which  can  be  put  to  better  use  elsewhere.  Both  the  Pourzandi/Touranchean  and 
Arbenz/Slapnicar  studies  used  a  ID  data  layout  in  Jacobi  although  a  2D  data  layout  offers 
lower  communication  and  costs  0(-^)  versus  0(n2)  and  lower  overhead  costs.  They  would 
have  done  better  to  use  a  2D  data  layout  and  delayed  (potentially  forever)  consideration  of 
communication  and  computation  overlap. 


B.2  Matlab  codes 

B.2.1  Jacobi 

The  following  is  the  matlab  code  for  Table  7.4. 

n  =  1000; 
p  =  64; 

blacsalpha  =  65.9e-6; 
blacsbeta= . 146e-6 ; 
dividebeta=3 . 85e-6 ; 
squarerootbeta=7 . 7e-6 ; 
blasonebeta= . 074e-6 ; 
dgemmalpha=103e-6 ; 
dgemmbeta= . 0215e-6 ; 

term(l)  =  8  *  sqrt(p)  *  (  log2(p)  -  3  )  *  blacsalpha 

term(2)  =  7/2  *  n~2  /  sqrt(p)  *  blacsbeta 

term(3)  =  1/8  *  n~2  /  sqrt(p)  *  log2(p)  *  blacsbeta 

term(4)  =  1/2  *  n~2  /  sqrt(p)  *  dividebeta 

term(5)  =  1/4  *  n~2  /  sqrt(p)  *  squarerootbeta 

term(6)  =  3/8  *  n~3  /  p  *  blasonebeta 

term(7)  =  8  *  sqrt(p)  *  dgemmalpha 

term(8)  =  5  *  n~3  /  p  *  dgemmbeta 

time  =  sum (term) 
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Appendix  C 

Miscellaneous  matlab  codes 

C.l  Reduction  to  tridiagonal  form 

The  following  matlab  code  performs  an  unblocked  reduction  to  tridiagonal  form. 
It  produces  the  same  values,  up  to  roundoff,  of  D ,  E  and  TAU  as  LAPACK’s  DSYTRD  and 
ScaLAPACK’s  PDSYTRD. 


I 

°/0  tridi  -  An  unblocked,  non-syymetric  reduction  to  tridiagonal  form 

l 

"/,  This  file  creates  an  input  matrix  A,  reduces  it  to  tridiagonal  form 
°/0  and  tests  to  make  sure  that  the  reduction  was  performed  correctly. 

I 

°/0  outputs : 

°/0  D,  E  -  The  tridiagonal  matrix 

"/,  tau 

"/,  A  -  The  lower  half  holds  the  householder  updates 

l 

l 

°/0  Produce  the  input  matrix 

l 

1  =  7; 

A  =  hilb(N)  +  toeplitz(  [  1  (l:(N-l))*i  ]  ); 

B  =  A;  "/0  Keep  a  copy  to  check  our  work  later. 


I 

"/0  Reduce  to  tridiagonal  form 

l 

n  =  size(A, 1) ; 
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I  =  eye(N) ; 
for  j  =l:n-l 

l 

°/0  Compute  the  householder  vector:  v 

l 

clear  v; 

v(l:n,l)  =  zeros(n,l); 
v(j+l :n, 1)  =  A(j+1 :n, j ) ; 
alpha  =  A(j+1 , j ) ; 

beta  =  -  norm(v)  *  real (alpha)  /  abs(  real (alpha)  )  ; 

tau(j)  =  (  beta  -  alpha  )  /  beta  ; 
v  =  v  /  (  alpha  -  beta  )  ; 
v(j+l)  =  1.0  ; 


l 

"/,  Perform  the  matrix  vector  multiply: 

l 

w  =  A  *  v  ; 


l 

°/0  Compute  the  companion  update  vector:  w 

l 

w  =  tau(j)  *  w  ; 
c  =  w’  *  v; 

w  =  (w  -  (c  *  tau(j)  /  2  )  *  v  ); 

D  ( j )  =  A( j , j) ; 

E(j)  =  beta  ; 

l 

°/0  Updte  the  trailing  matrix 

l 

A  =  A  -  v  *  w’  -  w  *  v’ ; 


l 

"/0  Store  the  household  vector  back  into  A 

l 

A(j+2:n, j)  =  v(j+2:n) ; 
end 

D(n)  =  A(n,n) ; 
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l 

°/0  Check  to  make  sure  that  the  reduction  was  performed  correctly. 

I 

DE  =  diag(D)  +  diag(E,-l)  +  diag(E,l)  ; 

Q=i; 

for  j  =  l:n-l 
clear  house 

house(l:n,l)  =  zeros(n,l); 
house(j+l :n, 1)  =  A(j+l:n,j); 
house(j+l,l)  =  1.0; 

Q  =  (I-  tau(j)’  *  house  *  house’)  *  Q  ; 

end 

norm(  B  -  Q’  *  DE  *  Q  ) 


